Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap communication and computation in CUDA cell_loop [WIP] #8882

Conversation

peterrum
Copy link
Member

@peterrum peterrum commented Oct 1, 2019

This PR enables to the overlap of communication and computation in CUDAWrapper::MatrixFree::cell_loop().

Sorry for the next [WIP] PR, however, I would like to share the changes already and get feedback already now. I have somewhat re/misused the coloring scheme ;)

ping @kronbichler @Rombur @masterleinad

@kronbichler
Copy link
Member

This looks excellent from an execution point of view. (Obviously, we need to think about what to do with the existing functionality.)

bool flag = false;

for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++)
if(cell->vertex_index (i))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong ;) I have to replace it with if(ghost_vertices[cell->vertex_index (i)])!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this bug break step-64 :(

@Rombur
Copy link
Member

Rombur commented Oct 2, 2019

I like the idea. I quickly went over the code and my understanding is that coloring and ghost_coloring cannot be true at the same time. So we need to check that. Also I am not sure about the name ghost_coloring because the goal is very different of the other coloring.

@Rombur
Copy link
Member

Rombur commented Apr 13, 2020

@peterrum I'd like to get this merged before the release. What is left to do? I think that the ghost_coloring flag can go. If you use coloring nothing changes and if you don't (which is what you should do on modern architecture), then we should always uses the new path.

@peterrum
Copy link
Member Author

I would prefer if it would be like in the case of CPU version:

const bool overlap_communication_computation = true,

so that there would be still the option to turn on and off overlap computation and communication.

To the state: I had some issues with the code which I posted here (which should be the same as here https://github.com/peterrum/dealii/blob/1278ba52648fc3e7949d376795133650e6aa47b2/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1086-L1116). I don't recall exactly what the problem was but the result was wrong under certain configurations. After some debugging, I limited the size of each color (see https://github.com/peterrum/dealii/blob/85bd6e8cacd59b949920a161dfad8588faabb9c5/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1107-L1146). This worked better, but I am not 100% sure if this fixed the problem. This is when my account at Piz Daint expired...

My guess is that the code ist 95% ready but there is a (minor) bug somewhere. Currently, I cannot work on this PR. However, if you have the time and resources to work on this PR, I would be happy to help and discuss aspects!

@Rombur
Copy link
Member

Rombur commented Apr 13, 2020

but the result was wrong under certain configurations

Do you remember what kind of configurations? Did it depend on the number of GPU? The size of the problem?

@peterrum
Copy link
Member Author

I would say it was related to the number of refinements (for multiple GPUs - I think more than 2).

@peterrum
Copy link
Member Author

I have been using this code https://github.com/peterrum/deal-and-ceed-on-gpu/tree/fix.

@Rombur
Copy link
Member

Rombur commented Apr 13, 2020

Great thanks.

bool flag = false;

for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++)
if(cell->vertex_index (i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this bug break step-64 :(

dst.compress(VectorOperation::add);
src.zero_out_ghosts();
src.update_ghost_values_finish();
internal::apply_kernel_shmem<dim, Number, Functor>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kernel should be preceded and followed by cudaDeviceSynchronize() because of CUDA-aware MPI. How do you know that the data exchanged during the update_ghost/compress has been computed by the time the computation is done? If you don't use CUDA-aware MPI, we know that everything is using the same stream so it implicitly synchronize but we don't know when using CUDA-aware MPI. This also explains the problem you saw: you have a race condition and using more colors changes the time require to loop over the domain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rombur Thanks for the debugging!

I am wondering which version of the code are you using? Are you using this version or the version on https://github.com/peterrum/dealii/tree/dealii-on-gpu? I would think 369a5b9 is working. However, what we observed at the workshop that the memcopy DtoH and HtoD was not overlapped, that is why we introduced the streams in 7598802. I am afraid here we indeed need the synchronizations (before and after the second block - although I hoped that it is implicitly synchronized by the default stream).

Copy link
Member

@Rombur Rombur Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the bug. You can't see it if you use CUDA-aware MPI but funny enough you can see it in serial if you use regular MPI. In that case, we copy all the data to host, perform the communication, and then copy all the data back to the device. This means that if we change the data on the device, it is overwritten by the data from the host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the HtoD in update_ghost_values_finish() overrides values that have been computed in the first block? Do I understand it correctly? But how is that possible? Only the ghost values are copied around and that should be covered by the coloring (if it is correct), shouldn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually the problem is here in compress_finish. The values computed in block 2 are not taken into account.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem is that when you don't use CUDA-aware MPI, the function import_from_ghosted_array_finish uses the data that has been copied to the host during the call to compress_start. If you update the data on the device after the call to compress_start, the host is not updated and you lose the new values. I need to double check but I think the solution is to copy the ghost data to the device, then do the call to import_from_ghosted_array_finish, and change import_from_ghosted_array_finish so that there is one CUDA path and one host path regardless of CUDA-aware MPI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT, isn't it?

Copy link
Member Author

@peterrum peterrum Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I wanted to do here was not write #ifdef DEAL_II_COMPILER_CUDA_AWARE but #ifdef DEAL_II_MPI_WITH_CUDA_SUPPORT. Sorry about this. I guess I did not notice this since I was building cuda-aware at that time. Overlapping of computation and communication only makes sense for cuda-aware MPI...

The whole purpose of the cude-aware MPI implementation is not to copy the whole vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT, isn't it?

That's right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

I noticed your message to late...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants