Overlap communication and computation in CUDA cell_loop [WIP] #8882

peterrum · 2019-10-01T23:42:32Z

This PR enables to the overlap of communication and computation in CUDAWrapper::MatrixFree::cell_loop().

Sorry for the next [WIP] PR, however, I would like to share the changes already and get feedback already now. I have somewhat re/misused the coloring scheme ;)

ping @kronbichler @Rombur @masterleinad

kronbichler · 2019-10-02T07:42:13Z

This looks excellent from an execution point of view. (Obviously, we need to think about what to do with the existing functionality.)

peterrum · 2019-10-02T10:03:27Z

include/deal.II/matrix_free/cuda_matrix_free.templates.h

+              bool flag = false;
+
+              for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++)
+                if(cell->vertex_index (i))


This is wrong ;) I have to replace it with if(ghost_vertices[cell->vertex_index (i)])!

Fixing this bug break step-64 :(

Rombur · 2019-10-02T21:05:20Z

I like the idea. I quickly went over the code and my understanding is that coloring and ghost_coloring cannot be true at the same time. So we need to check that. Also I am not sure about the name ghost_coloring because the goal is very different of the other coloring.

Rombur · 2020-04-13T02:31:07Z

@peterrum I'd like to get this merged before the release. What is left to do? I think that the ghost_coloring flag can go. If you use coloring nothing changes and if you don't (which is what you should do on modern architecture), then we should always uses the new path.

peterrum · 2020-04-13T11:27:08Z

I would prefer if it would be like in the case of CPU version:

dealii/include/deal.II/matrix_free/matrix_free.h

Line 227 in 2f9a789

const bool overlap_communication_computation = true,

so that there would be still the option to turn on and off overlap computation and communication.

To the state: I had some issues with the code which I posted here (which should be the same as here https://github.com/peterrum/dealii/blob/1278ba52648fc3e7949d376795133650e6aa47b2/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1086-L1116). I don't recall exactly what the problem was but the result was wrong under certain configurations. After some debugging, I limited the size of each color (see https://github.com/peterrum/dealii/blob/85bd6e8cacd59b949920a161dfad8588faabb9c5/include/deal.II/matrix_free/cuda_matrix_free.templates.h#L1107-L1146). This worked better, but I am not 100% sure if this fixed the problem. This is when my account at Piz Daint expired...

My guess is that the code ist 95% ready but there is a (minor) bug somewhere. Currently, I cannot work on this PR. However, if you have the time and resources to work on this PR, I would be happy to help and discuss aspects!

Rombur · 2020-04-13T12:21:31Z

but the result was wrong under certain configurations

Do you remember what kind of configurations? Did it depend on the number of GPU? The size of the problem?

peterrum · 2020-04-13T12:27:50Z

I would say it was related to the number of refinements (for multiple GPUs - I think more than 2).

peterrum · 2020-04-13T12:35:13Z

I have been using this code https://github.com/peterrum/deal-and-ceed-on-gpu/tree/fix.

Rombur · 2020-04-13T12:36:53Z

Great thanks.

Rombur · 2020-04-19T03:43:03Z

include/deal.II/matrix_free/cuda_matrix_free.templates.h

+              bool flag = false;
+
+              for(unsigned int i = 0; i < GeometryInfo<dim>::vertices_per_cell; i++)
+                if(cell->vertex_index (i))


Fixing this bug break step-64 :(

Rombur · 2020-04-19T03:54:47Z

include/deal.II/matrix_free/cuda_matrix_free.templates.h

-        dst.compress(VectorOperation::add);
-        src.zero_out_ghosts();
+          src.update_ghost_values_finish();
+          internal::apply_kernel_shmem<dim, Number, Functor>


I think this kernel should be preceded and followed by cudaDeviceSynchronize() because of CUDA-aware MPI. How do you know that the data exchanged during the update_ghost/compress has been computed by the time the computation is done? If you don't use CUDA-aware MPI, we know that everything is using the same stream so it implicitly synchronize but we don't know when using CUDA-aware MPI. This also explains the problem you saw: you have a race condition and using more colors changes the time require to loop over the domain.

@Rombur Thanks for the debugging!

I am wondering which version of the code are you using? Are you using this version or the version on https://github.com/peterrum/dealii/tree/dealii-on-gpu? I would think 369a5b9 is working. However, what we observed at the workshop that the memcopy DtoH and HtoD was not overlapped, that is why we introduced the streams in 7598802. I am afraid here we indeed need the synchronizations (before and after the second block - although I hoped that it is implicitly synchronized by the default stream).

I found the bug. You can't see it if you use CUDA-aware MPI but funny enough you can see it in serial if you use regular MPI. In that case, we copy all the data to host, perform the communication, and then copy all the data back to the device. This means that if we change the data on the device, it is overwritten by the data from the host.

Does this mean that the HtoD in update_ghost_values_finish() overrides values that have been computed in the first block? Do I understand it correctly? But how is that possible? Only the ghost values are copied around and that should be covered by the coloring (if it is correct), shouldn't it?

No actually the problem is here in compress_finish. The values computed in block 2 are not taken into account.

So the problem is that when you don't use CUDA-aware MPI, the function import_from_ghosted_array_finish uses the data that has been copied to the host during the call to compress_start. If you update the data on the device after the call to compress_start, the host is not updated and you lose the new values. I need to double check but I think the solution is to copy the ghost data to the device, then do the call to import_from_ghosted_array_finish, and change import_from_ghosted_array_finish so that there is one CUDA path and one host path regardless of CUDA-aware MPI.

This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT, isn't it?

I think what I wanted to do here was not write #ifdef DEAL_II_COMPILER_CUDA_AWARE but #ifdef DEAL_II_MPI_WITH_CUDA_SUPPORT. Sorry about this. I guess I did not notice this since I was building cuda-aware at that time. Overlapping of computation and communication only makes sense for cuda-aware MPI...

The whole purpose of the cude-aware MPI implementation is not to copy the whole vector.

This is the if defined DEAL_II_COMPILER_CUDA_AWARE && !defined DEAL_II_MPI_WITH_CUDA_SUPPORT branch. In this configuration the whole vector is copied from and to the host. But that is not the case for if defined DEAL_II_COMPILER_CUDA_AWARE && defined DEAL_II_MPI_WITH_CUDA_SUPPORT, isn't it?

That's right.

That's right.

I noticed your message to late...

Overlap communication and computation in cell_loop

9834699

peterrum commented Oct 2, 2019

View reviewed changes

masterleinad added the WIP ⚠️ label Oct 5, 2019

masterleinad added the GPU label Dec 10, 2019

Rombur reviewed Apr 19, 2020

View reviewed changes

Rombur mentioned this pull request Apr 27, 2020

Fix assertion in CUDA #9964

Merged

masterleinad added this to the Release 9.2 milestone Apr 27, 2020

Rombur mentioned this pull request May 7, 2020

CUDA matrix-free: overlap communication and computation #10053

Merged

peterrum closed this May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap communication and computation in CUDA cell_loop [WIP] #8882

Overlap communication and computation in CUDA cell_loop [WIP] #8882

peterrum commented Oct 1, 2019

kronbichler commented Oct 2, 2019

peterrum Oct 2, 2019

Rombur Apr 19, 2020

Rombur commented Oct 2, 2019

Rombur commented Apr 13, 2020

peterrum commented Apr 13, 2020

Rombur commented Apr 13, 2020

peterrum commented Apr 13, 2020

peterrum commented Apr 13, 2020

Rombur commented Apr 13, 2020

Rombur Apr 19, 2020

Rombur Apr 19, 2020

peterrum Apr 19, 2020

Rombur Apr 21, 2020 •

edited

peterrum Apr 21, 2020

Rombur Apr 21, 2020

Rombur Apr 21, 2020

peterrum Apr 21, 2020

peterrum Apr 21, 2020 •

edited

Rombur Apr 21, 2020

peterrum Apr 21, 2020

Overlap communication and computation in CUDA cell_loop [WIP] #8882

Overlap communication and computation in CUDA cell_loop [WIP] #8882

Conversation

peterrum commented Oct 1, 2019

kronbichler commented Oct 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur commented Oct 2, 2019

Rombur commented Apr 13, 2020

peterrum commented Apr 13, 2020

Rombur commented Apr 13, 2020

peterrum commented Apr 13, 2020

peterrum commented Apr 13, 2020

Rombur commented Apr 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur Apr 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterrum Apr 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur Apr 21, 2020 •

edited

peterrum Apr 21, 2020 •

edited