-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Using amgcl with MPI #71
Comments
I do need to find time to provide a better documentation for the MPI functionality.
You should use a library like metis or scotch to partition your system. The matrix then needs to be reordered according to the partitioning. AMGCL assumes that the system matrix has been reordered already, and each MPI domain owns a contiguous chunk of matrix rows. The column numbers in the matrix are global (each row completely belongs to a single MPI process)
I am sorry, I do not understand this question.
Global column numbers are used.
Each MPI process sends its own chunk of matrix rows to AMGCL. Since AMGCL assumes the matrix has been reordered already, it can deduce the global matrix size. Not sure though if this was the question?
Yes, as long as the custom matrix class returns rows according to the above points. |
I see, I'll have a look at it. Thank you! |
Ok, so using a partitioner seems like overkill for the simple domain I want to partition (a cube of hexahedral elements). Let's say I have divided my domain that is, each dof of my system belongs to one sub-domain, except for the ones at the border between sub-domains which belong to all those sub-domains that share this border. My question: You said that AMGCL assumed the matrix has been already reordered. So what else do I have to do? Can I distribute my partitions to the different MPI ranks arbitrarily? Thanks for your help so far. Even though I am asking a lot of questions I am really impressed how much complexity AMGCL actually hides from me, e.g. switching between openMP and openCL is trivial! |
May be a simple example would be of help here. Let's say you have a 4x4 global matrix:
If you have two MPI processes, then the matrix would be split between those as: Process 0
Process 1
Now, let's say you have a 4x4 2D grid, on which you want to solve a Poisson problem. The grid points (dofs) are numbered like this originally:
If you have 4 MPI domains, you could partition the problem as
and then renumber the dofs so that each block has continuous chunk of dofs:
The parts of the finite-difference-discretized 16x16 matrix owned by each of the MPI processes may be written as following (sparse notation is used, where Process 0
Process 1
Process 2
Process 3
And this is exactly how you would feed this matrix to amgcl. Stars here mean that a column belongs to a remote process, but the nonzero matrix value is still stored on the process that owns the row. I hope this helps. |
Wow, thank you so much! I think everything is clear now. I think this answer could directly go into the documentation, it really explains all the aspects relevant to be able to use amgcl with mpi. |
Please include a best practice example about how to distribute a global NxN matrix available in CSR format to the individual MPI processes when you're updating the manual. What is the related multi node backend concept? Hybrid MPI (nodes) + OMP(cores/GPUs of each node) or MPI managing all cores and/or GPUs across all nodes. |
Thank you for reminding me I need to finish the new documentation :). Re multinode configuration: With the builtin (openmp) backend you can use either hybrid (MPI+OpenMP) or uniform (MPI-only, with When using GPU backends, it is normal to use one MPI process per GPU. VexCL backend supports using multiple GPUs per MPI process, but again, I think the scalability will suffer in this case. |
What do you mean by "reordering"? My starting point is the global matrix in (a multi section) COO format (nothing to do with decomposition, it's just not stored as a whole) and I am wondering how to pass it to multiple MPI processes? |
Take a look at the #71 (comment) above, specifically at the "4 MPI domains" example. Reordering or renumbering makes sure that each of your MPI domains own continuous set of unknown indices. There is a good chance your assembly process already satisfies this condition, so you don't have to do anything. |
what if the number of rows is not a multiply of number of processors, then what is the local size. Thank you! |
Say I have 15 rows and 4 processors, how many rows for each processor? |
Usually, that decision is left to a decomposition library like Metis or Scotch. But in general, you try to split the problem as evenly as possible, so maybe 3 + 3 + 3 + 4? Of course, when you have just 15 rows in the matrix, it does not make sense to use MPI (and when you have more, the differences between number of rows in each domain become relatively negligible). |
Thank you for the quick response! What I concern is that, for example, the problem is split into 3+3+3+4, so I have to fill the first 3 rows on processor 0, then 3 on processor 1, and so on. Does that mean I have to know the exact splitting, otherwise, I may misplace a row in a wrong processor? This same question may be asked in another way, how amgcl assumes which rows belong to which processors? I hope this makes sense to you. Thank you so much! |
You, as amgcl user, are the one responsible for the splitting (as I said, this task is usually delegated to a specialized library, but that happens outside of amgcl in any case). In each of the MPI processes, you only send to amgcl the rows that belong to that process. amgcl does not have access to the full matrix. |
Yes, thank you! I understand that amgcl is not responsible for splitting. My question is that whether amgcl assumes the rows in different processors goes continuously. With that I mean if processor 0 has 10 rows, process 1 has 8 rows, process 2 has 9 rows, amgcl assumes the full matrix has 27 rows, the first 10 is in P0, second 8 in P1, and third 9 in P2, regardless how many rows each processor has. Is that correct? |
Yes, that is correct: amgcl assumes that the rows given to it are continuous and when stacked together they are forming the full matrix. In order to satisfy the requirement, you have to apply the renumbering procedure to the original matrix (described in the comment above). |
That's cool! I was wonder how amgcl knows the row ids. I got the part for the renumbering. Just make sure the numbering goes from P0 through Pn-1. It may require some calls to move rows around. There is no shared rows, right? All processors should have exclusive rows. |
Thank you for all your work! I have tested with my problems. amgcl is much faster than Petsc. |
amgcl/amgcl/mpi/distributed_matrix.hpp Lines 334 to 341 in 461a66c
This function in amgcl/examples/mpi/mpi_amg.cpp Lines 188 to 231 in 461a66c
|
Sound good! May I ask one more question. What's the difference between solve_block and solve_scalar in mpi_amg.cpp Currently, I am using the examples from the benchmark repo. |
|
So I am reading through the benchmark distributed Poisson equation implementation to figure out how I could have the linear elasticity problem I have with amgcl run on several MPI nodes. But I still have some questions I was hoping you might answer:
The text was updated successfully, but these errors were encountered: