Question: Using amgcl with MPI #71

NOhs · 2018-04-18T12:58:04Z

So I am reading through the benchmark distributed Poisson equation implementation to figure out how I could have the linear elasticity problem I have with amgcl run on several MPI nodes. But I still have some questions I was hoping you might answer:

How do I split up my matrix on each node? Assuming I have hexahedral elements. Do I create smaller matrices for sub-volumes of the full image?
Do I have one node in each direction of overlap?
Are the matrices on each node stored with local indices, or do I store the global index for each element?
Do I need to tell amgcl the overlap somewhere? I couldn't find anything hinting at that.
Is it also possible for the MPI version to use my custom matrix class as input (which is working for the shared memory version)?

ddemidov · 2018-04-18T14:12:11Z

I do need to find time to provide a better documentation for the MPI functionality.

How do I split up my matrix on each node? Assuming I have hexahedral elements. Do I create smaller matrices for sub-volumes of the full image?

You should use a library like metis or scotch to partition your system. The matrix then needs to be reordered according to the partitioning. AMGCL assumes that the system matrix has been reordered already, and each MPI domain owns a contiguous chunk of matrix rows. The column numbers in the matrix are global (each row completely belongs to a single MPI process)

Do I have one node in each direction of overlap?

I am sorry, I do not understand this question.

Are the matrices on each node stored with local indices, or do I store the global index for each element?

Global column numbers are used.

Do I need to tell amgcl the overlap somewhere? I couldn't find anything hinting at that.

Each MPI process sends its own chunk of matrix rows to AMGCL. Since AMGCL assumes the matrix has been reordered already, it can deduce the global matrix size. Not sure though if this was the question?

Is it also possible for the MPI version to use my custom matrix class as input (which is working for the shared memory version)?

Yes, as long as the custom matrix class returns rows according to the above points.

NOhs · 2018-04-19T10:43:53Z

I see, I'll have a look at it. Thank you!

NOhs · 2018-05-04T21:44:03Z

Ok, so using a partitioner seems like overkill for the simple domain I want to partition (a cube of hexahedral elements). Let's say I have divided my domain that is, each dof of my system belongs to one sub-domain, except for the ones at the border between sub-domains which belong to all those sub-domains that share this border.
Now I created a sparse matrix for each MPI node. Each MPI node owns one sub-domain. That means in its sparse matrix, this MPI node has the rows of each dof that belongs to this MPI node's sub-domain. That means that certain rows appear in multiple MPI nodes.
In the sparse matrix, the column-numbers that are stored are the global ones meaning they correspond to the case of no MPI.

My question: You said that AMGCL assumed the matrix has been already reordered. So what else do I have to do? Can I distribute my partitions to the different MPI ranks arbitrarily?

Thanks for your help so far. Even though I am asking a lot of questions I am really impressed how much complexity AMGCL actually hides from me, e.g. switching between openMP and openCL is trivial!

ddemidov · 2018-05-05T06:27:01Z

May be a simple example would be of help here. Let's say you have a 4x4 global matrix:

|  2 -1  0  0 |
| -1  2 -1  0 |
|  0 -1  2 -1 |
|  0  0 -1  2 |

If you have two MPI processes, then the matrix would be split between those as:

Process 0

|  2 -1  0  0 |
| -1  2 -1  0 |

Process 1

|  0 -1  2 -1 |
|  0  0 -1  2 |

Now, let's say you have a 4x4 2D grid, on which you want to solve a Poisson problem. The grid points (dofs) are numbered like this originally:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

If you have 4 MPI domains, you could partition the problem as

 0  1 |  2  3
 4  5 |  6  7
--------------
 8  9 | 10 11
12 13 | 14 15

and then renumber the dofs so that each block has continuous chunk of dofs:

 0  1 |  4  5
 2  3 |  6  7
--------------
 8  9 | 12 13
10 11 | 14 15

The parts of the finite-difference-discretized 16x16 matrix owned by each of the MPI processes may be written as following (sparse notation is used, where (c: v) means there is nonzero value v at column c):

Process 0

row 0: (0: 4) (1: -1) (2: -1)
row 1: (0: -1) (1: 4) (4: -1)* (3: -1)
row 2: (0: -1) (2: 4) (3: -1) (8: -1)*
row 3: (1: -1) (2: -1) (3: 4) (6: -1)* (9: -1)*

Process 1

row 0: (1: -1)* (4: 4) (5:-1) (6: -1)
row 1: (4: -1) (5: 4) (7: -1)
row 2: (4: -1) (3: -1)* (6: 4) (7: -1) (12: -1)*
row 3: (5: -1) (6: -1) (7: 4) (13: -1)*

Process 2

row 0: (2: -1)* (8: 4) (9: -1) (10: -1)
row 1: (3: -1)* (8: -1) (9: 4) (12: -1)* (11: -1)
row 2: (8: -1) (10: 4) (11: -1)
row 3: (9: -1) (10: -1) (11: 4) (14: -1)*

Process 3

row 0: (6: -1)* (9: -1)* (12: 4) (13: -1) (14: -1)
row 1: (7: -1)* (12: -1) (13: 4) (15: -1)
row 2: (12: -1) (11: -1)* (14: 4) (15: -1)
row 3: (13: -1) (14: -1) (15: 4)

And this is exactly how you would feed this matrix to amgcl. Stars here mean that a column belongs to a remote process, but the nonzero matrix value is still stored on the process that owns the row.

I hope this helps.

NOhs · 2018-05-05T12:14:25Z

Wow, thank you so much! I think everything is clear now. I think this answer could directly go into the documentation, it really explains all the aspects relevant to be able to use amgcl with mpi.

klausbu · 2018-08-24T21:09:44Z

Please include a best practice example about how to distribute a global NxN matrix available in CSR format to the individual MPI processes when you're updating the manual.

What is the related multi node backend concept? Hybrid MPI (nodes) + OMP(cores/GPUs of each node) or MPI managing all cores and/or GPUs across all nodes.

ddemidov · 2018-08-25T07:05:16Z

Thank you for reminding me I need to finish the new documentation :).

Re multinode configuration:

With the builtin (openmp) backend you can use either hybrid (MPI+OpenMP) or uniform (MPI-only, with export OMP_NUM_THREADS=1) setup. From experience, MPI-only setup scales slightly better. There is some data supporting this in the unfinished version of the benchmarks page here: https://amgcl.readthedocs.io/en/update-docs/benchmarks.html#id1

When using GPU backends, it is normal to use one MPI process per GPU. VexCL backend supports using multiple GPUs per MPI process, but again, I think the scalability will suffer in this case.

klausbu · 2018-11-12T14:22:11Z

What do you mean by "reordering"?

My starting point is the global matrix in (a multi section) COO format (nothing to do with decomposition, it's just not stored as a whole) and I am wondering how to pass it to multiple MPI processes?

ddemidov · 2018-11-12T14:50:15Z

Take a look at the #71 (comment) above, specifically at the "4 MPI domains" example. Reordering or renumbering makes sure that each of your MPI domains own continuous set of unknown indices. There is a good chance your assembly process already satisfies this condition, so you don't have to do anything.

zhuminjie · 2020-02-28T01:47:54Z

what if the number of rows is not a multiply of number of processors, then what is the local size. Thank you!

zhuminjie · 2020-02-28T01:48:36Z

Say I have 15 rows and 4 processors, how many rows for each processor?

ddemidov · 2020-02-28T04:31:53Z

Usually, that decision is left to a decomposition library like Metis or Scotch. But in general, you try to split the problem as evenly as possible, so maybe 3 + 3 + 3 + 4?

Of course, when you have just 15 rows in the matrix, it does not make sense to use MPI (and when you have more, the differences between number of rows in each domain become relatively negligible).

zhuminjie · 2020-02-28T04:56:15Z

Thank you for the quick response! What I concern is that, for example, the problem is split into 3+3+3+4, so I have to fill the first 3 rows on processor 0, then 3 on processor 1, and so on. Does that mean I have to know the exact splitting, otherwise, I may misplace a row in a wrong processor?

This same question may be asked in another way, how amgcl assumes which rows belong to which processors? I hope this makes sense to you. Thank you so much!

ddemidov · 2020-02-28T06:11:59Z

You, as amgcl user, are the one responsible for the splitting (as I said, this task is usually delegated to a specialized library, but that happens outside of amgcl in any case). In each of the MPI processes, you only send to amgcl the rows that belong to that process. amgcl does not have access to the full matrix.

zhuminjie · 2020-02-28T06:25:52Z

Yes, thank you! I understand that amgcl is not responsible for splitting. My question is that whether amgcl assumes the rows in different processors goes continuously. With that I mean if processor 0 has 10 rows, process 1 has 8 rows, process 2 has 9 rows, amgcl assumes the full matrix has 27 rows, the first 10 is in P0, second 8 in P1, and third 9 in P2, regardless how many rows each processor has. Is that correct?

ddemidov · 2020-02-28T06:28:54Z

Yes, that is correct: amgcl assumes that the rows given to it are continuous and when stacked together they are forming the full matrix. In order to satisfy the requirement, you have to apply the renumbering procedure to the original matrix (described in the comment above).

zhuminjie · 2020-02-28T06:33:02Z

That's cool! I was wonder how amgcl knows the row ids. I got the part for the renumbering. Just make sure the numbering goes from P0 through Pn-1. It may require some calls to move rows around. There is no shared rows, right? All processors should have exclusive rows.

zhuminjie · 2020-02-28T06:33:37Z

Thank you for all your work! I have tested with my problems. amgcl is much faster than Petsc.

ddemidov · 2020-02-28T06:43:39Z

I was wonder how amgcl knows the row ids.

amgcl/amgcl/mpi/distributed_matrix.hpp

Lines 334 to 341 in 461a66c

    
           // Get sizes of each domain in comm. 
        
           std::vector<ptrdiff_t> domain = comm.exclusive_sum(n_loc_cols); 
        
           ptrdiff_t loc_beg = domain[comm.rank]; 
        
           ptrdiff_t loc_end = domain[comm.rank + 1]; 
        
           n_glob_cols     = domain.back(); 
        
           n_glob_rows     = comm.reduce(MPI_SUM, n_loc_rows); 
        
           n_glob_nonzeros = comm.reduce(MPI_SUM, n_loc_nonzeros);

It may require some calls to move rows around

This function in examples/mpi/mpi_amg.cpp does the partitioning and renumbering, you can peek into it (or just use the code). It takes the rows from the original matrix (continuous chunks of rows), calls the partitioning library to obtain a better partitioning (parmetis or scotch, depending on what is available during compilation and the user choice), and returns the renumbered matrix (and replaces the rhs vector with the renumbered one):

amgcl/examples/mpi/mpi_amg.cpp

Lines 188 to 231 in 461a66c

    
           //--------------------------------------------------------------------------- 
        
           template <class Backend, class Matrix> 
        
           std::shared_ptr< amgcl::mpi::distributed_matrix<Backend> > 
        
           partition(amgcl::mpi::communicator comm, const Matrix &Astrip, 
        
                   typename Backend::vector &rhs, const typename Backend::params &bprm, 
        
                   amgcl::runtime::mpi::partition::type ptype, int block_size = 1) 
        
           { 
        
               typedef typename Backend::value_type val_type; 
        
               typedef typename amgcl::math::rhs_of<val_type>::type rhs_type; 
        
               typedef amgcl::mpi::distributed_matrix<Backend> DMatrix; 
        
               using amgcl::prof; 
        
               auto A = std::make_shared<DMatrix>(comm, Astrip); 
        
               if (comm.size == 1 || ptype == amgcl::runtime::mpi::partition::merge) 
        
                   return A; 
        
               prof.tic("partition"); 
        
               boost::property_tree::ptree prm; 
        
               prm.put("type", ptype); 
        
               prm.put("shrink_ratio", 1); 
        
               amgcl::runtime::mpi::partition::wrapper<Backend> part(prm); 
        
               auto I = part(*A, block_size); 
        
               auto J = transpose(*I); 
        
               A = product(*J, *product(*A, *I)); 
        
           #if defined(SOLVER_BACKEND_BUILTIN) 
        
               amgcl::backend::numa_vector<rhs_type> new_rhs(J->loc_rows()); 
        
           #elif defined(SOLVER_BACKEND_VEXCL) 
        
               vex::vector<rhs_type> new_rhs(bprm.q, J->loc_rows()); 
        
           #elif defined(SOLVER_BACKEND_CUDA) 
        
               thrust::device_vector<rhs_type> new_rhs(J->loc_rows()); 
        
           #endif 
        
               J->move_to_backend(bprm); 
        
               amgcl::backend::spmv(1, *J, rhs, 0, new_rhs); 
        
               rhs.swap(new_rhs); 
        
               prof.toc("partition"); 
        
               return A; 
        
           }

zhuminjie · 2020-02-28T18:31:59Z

Sound good! May I ask one more question. What's the difference between solve_block and solve_scalar in mpi_amg.cpp

Currently, I am using the examples from the benchmark repo.

ddemidov · 2020-02-28T20:19:12Z

solve_block solves systems where the matrix has a block structure (such as elasticity problems).

ddemidov · 2020-11-26T12:29:22Z

See https://amgcl.readthedocs.io/en/latest/tutorial.html

ddemidov added the question label Jul 24, 2018

ddemidov closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Using amgcl with MPI #71

Question: Using amgcl with MPI #71

NOhs commented Apr 18, 2018

ddemidov commented Apr 18, 2018

NOhs commented Apr 19, 2018

NOhs commented May 4, 2018

ddemidov commented May 5, 2018 •

edited

Loading

NOhs commented May 5, 2018

klausbu commented Aug 24, 2018

ddemidov commented Aug 25, 2018

klausbu commented Nov 12, 2018

ddemidov commented Nov 12, 2018

zhuminjie commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020 •

edited

Loading

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

ddemidov commented Nov 26, 2020

Question: Using amgcl with MPI #71

Question: Using amgcl with MPI #71

Comments

NOhs commented Apr 18, 2018

ddemidov commented Apr 18, 2018

NOhs commented Apr 19, 2018

NOhs commented May 4, 2018

ddemidov commented May 5, 2018 • edited Loading

NOhs commented May 5, 2018

klausbu commented Aug 24, 2018

ddemidov commented Aug 25, 2018

klausbu commented Nov 12, 2018

ddemidov commented Nov 12, 2018

zhuminjie commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020 • edited Loading

zhuminjie commented Feb 28, 2020

ddemidov commented Feb 28, 2020

ddemidov commented Nov 26, 2020

ddemidov commented May 5, 2018 •

edited

Loading

ddemidov commented Feb 28, 2020 •

edited

Loading