-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap communication and computation in CUDA cell_loop [WIP] #8882
Overlap communication and computation in CUDA cell_loop [WIP] #8882
Conversation
This looks excellent from an execution point of view. (Obviously, we need to think about what to do with the existing functionality.) |
bool flag = false; | ||
|
||
for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++) | ||
if(cell->vertex_index (i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong ;) I have to replace it with if(ghost_vertices[cell->vertex_index (i)])
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this bug break step-64 :(
I like the idea. I quickly went over the code and my understanding is that |
@peterrum I'd like to get this merged before the release. What is left to do? I think that the |
I would prefer if it would be like in the case of CPU version:
so that there would be still the option to turn on and off overlap computation and communication. To the state: I had some issues with the code which I posted here (which should be the same as here https://github.com/peterrum/dealii/blob/1278ba52648fc3e7949d376795133650e6aa47b2/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1086-L1116). I don't recall exactly what the problem was but the result was wrong under certain configurations. After some debugging, I limited the size of each color (see https://github.com/peterrum/dealii/blob/85bd6e8cacd59b949920a161dfad8588faabb9c5/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1107-L1146). This worked better, but I am not 100% sure if this fixed the problem. This is when my account at Piz Daint expired... My guess is that the code ist 95% ready but there is a (minor) bug somewhere. Currently, I cannot work on this PR. However, if you have the time and resources to work on this PR, I would be happy to help and discuss aspects! |
Do you remember what kind of configurations? Did it depend on the number of GPU? The size of the problem? |
I would say it was related to the number of refinements (for multiple GPUs - I think more than 2). |
I have been using this code https://github.com/peterrum/deal-and-ceed-on-gpu/tree/fix. |
Great thanks. |
bool flag = false; | ||
|
||
for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++) | ||
if(cell->vertex_index (i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this bug break step-64 :(
dst.compress(VectorOperation::add); | ||
src.zero_out_ghosts(); | ||
src.update_ghost_values_finish(); | ||
internal::apply_kernel_shmem<dim, Number, Functor> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kernel should be preceded and followed by cudaDeviceSynchronize()
because of CUDA-aware MPI. How do you know that the data exchanged during the update_ghost/compress has been computed by the time the computation is done? If you don't use CUDA-aware MPI, we know that everything is using the same stream so it implicitly synchronize but we don't know when using CUDA-aware MPI. This also explains the problem you saw: you have a race condition and using more colors changes the time require to loop over the domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rombur Thanks for the debugging!
I am wondering which version of the code are you using? Are you using this version or the version on https://github.com/peterrum/dealii/tree/dealii-on-gpu? I would think 369a5b9 is working. However, what we observed at the workshop that the memcopy DtoH and HtoD was not overlapped, that is why we introduced the streams in 7598802. I am afraid here we indeed need the synchronizations (before and after the second block - although I hoped that it is implicitly synchronized by the default stream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the bug. You can't see it if you use CUDA-aware MPI but funny enough you can see it in serial if you use regular MPI. In that case, we copy all the data to host, perform the communication, and then copy all the data back to the device. This means that if we change the data on the device, it is overwritten by the data from the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the HtoD in update_ghost_values_finish()
overrides values that have been computed in the first block? Do I understand it correctly? But how is that possible? Only the ghost values are copied around and that should be covered by the coloring (if it is correct), shouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actually the problem is here in compress_finish
. The values computed in block 2 are not taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that when you don't use CUDA-aware MPI, the function import_from_ghosted_array_finish
uses the data that has been copied to the host during the call to compress_start
. If you update the data on the device after the call to compress_start
, the host is not updated and you lose the new values. I need to double check but I think the solution is to copy the ghost data to the device, then do the call to import_from_ghosted_array_finish
, and change import_from_ghosted_array_finish
so that there is one CUDA path and one host path regardless of CUDA-aware MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT
branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I wanted to do here was not write #ifdef DEAL_II_COMPILER_CUDA_AWARE
but #ifdef DEAL_II_MPI_WITH_CUDA_SUPPORT
. Sorry about this. I guess I did not notice this since I was building cuda-aware at that time. Overlapping of computation and communication only makes sense for cuda-aware MPI...
The whole purpose of the cude-aware MPI implementation is not to copy the whole vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT, isn't it?
That's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
I noticed your message to late...
This PR enables to the overlap of communication and computation in
CUDAWrapper::MatrixFree::cell_loop()
.Sorry for the next [WIP] PR, however, I would like to share the changes already and get feedback already now. I have somewhat re/misused the coloring scheme ;)
ping @kronbichler @Rombur @masterleinad