Skip to content

Commit

Permalink
MPI version of the Structural problem tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
ddemidov committed Nov 23, 2020
1 parent 9867062 commit a7bffcf
Show file tree
Hide file tree
Showing 10 changed files with 360 additions and 113 deletions.
4 changes: 2 additions & 2 deletions docs/Serena.rst
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ contains the complete source for the solution (available at
backend version are highlighted.

.. literalinclude:: ../tutorial/2.Serena/serena_vexcl.cpp
:caption: The solution of the Serena problem with VexCL backend.
:caption: The solution of the Serena problem with the VexCL backend.
:language: cpp
:linenos:
:emphasize-lines: 4-5,28-29,32-33,81-82,98-99,106,119-120,127-128,131
Expand Down Expand Up @@ -413,7 +413,7 @@ The output of the program is shown below::
Iters: 162
Error: 9.74928e-09

[Serena: 27.208 s] (100.00%)
[Serena (VexCL): 27.208 s] (100.00%)
[ self: 0.180 s] ( 0.66%)
[ GPU matrix: 0.604 s] ( 2.22%)
[ read: 18.699 s] ( 68.73%)
Expand Down
135 changes: 133 additions & 2 deletions docs/SerenaMPI.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,146 @@
Structural problem (MPI version)
--------------------------------

In this section we look at how to use the MPI version of the AMGCL solver for
In this section we look at how to use the MPI version of the AMGCL solver with
the Serena_ system. We have already determined in the :doc:`Serena` section
that the system is best solved with the block-valued backend, and needs to be
scaled so that it has the unit diagonal.
scaled so that it has the unit diagonal. The MPI solution will be very closer
to the one we have seen in the :doc:`poisson3DbMPI` section. The only
differences are:

.. _Serena: https://sparse.tamu.edu/Janna/Serena

- The system needs to be scaled so that it has the unit diagonal. This is
complicated by the fact that the matrix product :math:`D^{-1/2} A D^{-1/2}`
has to done in the distributed memory environment.
- The solution has to use the block-valued backend, and the partitioning needs
to take this into account. Namely, the partitioning should not split any of
the :math:`3\times3` blocks between MPI processes.
- Even though the system is symmetric, the convergence of the CG solver in the
distributed case stalls at the relative error about :math:`10^{-6}`.
Switching to the BiCGStab solver helps with the convergence.

The next listing is the MPI version of the Serena_ system solver
(`tutorial/2.Serena/serena_mpi.cpp`_). In the following paragraphs we
highlight the differences between this version and the code in the
:doc:`poisson3DbMPI` and :doc:`Serena` sections.

.. _tutorial/2.Serena/serena_mpi.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/2.Serena/serena_mpi.cpp

.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
:caption: The MPI solution of the Serena problem
:language: cpp
:linenos:

We make sure that the paritioning takes the block structure of the matrix into
account by keeping the number of rows in the initial naive partitioning
divisible by 3 (here the constant ``B`` is equal to 3):

.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
:language: cpp
:linenos:
:lines: 46-53
:lineno-start: 46

We also create all the distributed matrices using the block values, so the
partitioning naturally is block-aware. We are using the mixed precision
approach, so the preconditioner backend is defined with the single precision:

.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
:language: cpp
:linenos:
:lines: 65-79
:lineno-start: 65

The scaling is done similarly to how we apply the reordering: first, we find
the diagonal of the local diagonal block on each of the MPI processes, and then
we create the distributed diagonal matrix with the inverted square root of the
system matrix diagonal. After that, the scaled matrix :math:`A_s = D^{-1/2} A
D^{-1/2}` is computed using the :cpp:func:`amgcl::mpi::product()` function.
The scaled RHS vector :math:`f_s = D^{-1/2} f` in principle may be found using
the :cpp:func:`amgcl::backend::spmv()` primitive, but, since the RHS vector in
this case is simply filled with ones, the scaled RHS :math:`f_s = D^{-1/2}`.

.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
:language: cpp
:linenos:
:lines: 85-125
:lineno-start: 85

Here is the output from the compiled program::

$ export OMP_NUM_THREADS=1
$ mpirun -np 4 ./serena_mpi Serena.bin
World size: 4
Matrix Serena.bin: 1391349x1391349
Partitioning[ParMETIS] 4 -> 4
Type: BiCGStab
Unknowns: 118533
Memory footprint: 18.99 M

Number of levels: 4
Operator complexity: 1.27
Grid complexity: 1.07

level unknowns nonzeros
---------------------------------
0 463783 7170189 (79.04%) [4]
1 32896 1752778 (19.32%) [4]
2 1698 144308 ( 1.59%) [4]
3 95 4833 ( 0.05%) [4]

Iterations: 80
Error: 9.34355e-09

[Serena MPI: 24.840 s] (100.00%)
[ partition: 1.159 s] ( 4.67%)
[ read: 0.265 s] ( 1.07%)
[ scale: 0.583 s] ( 2.35%)
[ setup: 0.811 s] ( 3.26%)
[ solve: 22.017 s] ( 88.64%)

The version that uses the VexCL backend should be familiar at this point. Below
is the source code (`tutorial/2.Serena/serena_mpi_vexcl.cpp`_) where the
differences with the builtin backend version are highlighted:

.. _tutorial/2.Serena/serena_mpi_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/2.Serena/serena_mpi_vexcl.cpp

.. literalinclude:: ../tutorial/2.Serena/serena_mpi_vexcl.cpp
:caption: The MPI solution of the Serena problem using the VexCL backend
:language: cpp
:linenos:
:emphasize-lines: 4-5,40-52,84-85,100-102,144,168-169,184,193-194

Here is the output of the MPI version with the VexCL backend::

$ export OMP_NUM_THREADS=1
$ mpirun -np 2 ./serena_mpi_vexcl_cl Serena.bin
0: GeForce GTX 960 (NVIDIA CUDA)
1: GeForce GTX 1050 Ti (NVIDIA CUDA)
World size: 2
Matrix Serena.bin: 1391349x1391349
Partitioning[ParMETIS] 2 -> 2
Type: BiCGStab
Unknowns: 231112
Memory footprint: 37.03 M

Number of levels: 4
Operator complexity: 1.27
Grid complexity: 1.07

level unknowns nonzeros
---------------------------------
0 463783 7170189 (79.01%) [2]
1 32887 1754795 (19.34%) [2]
2 1708 146064 ( 1.61%) [2]
3 85 4059 ( 0.04%) [2]

Iterations: 83
Error: 9.80582e-09

[Serena MPI(VexCL): 10.943 s] (100.00%)
[ partition: 1.357 s] ( 12.40%)
[ read: 0.370 s] ( 3.38%)
[ scale: 0.729 s] ( 6.66%)
[ setup: 1.966 s] ( 17.97%)
[ solve: 6.512 s] ( 59.51%)
38 changes: 21 additions & 17 deletions docs/poisson3DbMPI.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,11 @@ approach. Lets solve the same problem using the Message Passing Interface
(MPI), or the distributed memory approach. We already know that using the
smoothed aggregation AMG with the simple SPAI(0) smoother is working well, so
we may start writing the code immediately. The following is the complete
MPI-based implementation of the solver. We discuss it in more details below.
MPI-based implementation of the solver
(`tutorial/1.poisson3Db/poisson3Db_mpi.cpp`_). We discuss it in more details
below.

.. _tutorial/1.poisson3Db/poisson3Db_mpi.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db_mpi.cpp

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:caption: The MPI solution of the poisson3Db problem
Expand Down Expand Up @@ -93,13 +97,13 @@ have as much as twice fewer non-zeros compared to the naive partitioning of the
matrix. The solution :math:`x` in the original ordering may be obtained with
:math:`x = P y`.

In lines 37--55 we read the system matrix and the RHS vector using the naive
In lines 37--54 we read the system matrix and the RHS vector using the naive
ordering (better ordering of the unknowns will be determined later):

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 37-55
:lines: 37-54
:lineno-start: 37

First, we read the total (global) number of rows in the matrix from the binary
Expand All @@ -110,43 +114,43 @@ the RHS using :cpp:func:`amgcl::io::read_crs()` and
parameters to the functions specify the regions (in row numbers) to read. The
column indices are kept in global numbering.

In lines 63--73 we define the backend and the solver types:
In lines 62--72 we define the backend and the solver types:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 63-73
:lineno-start: 63
:lines: 62-72
:lineno-start: 62

The structure of the solver is the same as in the shared memory case in the
:doc:`poisson3Db` tutorial, but we are using the components from the
``amgcl::mpi`` namespace. Again, we are using the mixed-precision approach and
the preconditioner backend is defined with a single-precision value type.

In lines 75--77 we create the distributed matrix from the local strips read by
In lines 74--76 we create the distributed matrix from the local strips read by
each of the MPI processes:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 75-77
:lineno-start: 75
:lines: 74-76
:lineno-start: 74

We could directly use the tuple of the CRS arrays ``std::tie(chunk, ptr, col,
val)`` to construct the solver (the distributed matrix would be created behind
the scenes for us), but here we need to explicitly create the matrix for a
couple of reasons. First, since we are using the mixed-precision approach, we
need the double-precision distributed matrix for the solution step. And second,
the matrix will be used to repartition the system using either ParMETIS_ or
PT-SCOTCH_ libraries in lines 79--111:
PT-SCOTCH_ libraries in lines 78--110:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 79-111
:lineno-start: 79
:lines: 78-110
:lineno-start: 78

We determine if either ParMETIS_ or PT-SCOTCH_ is available in lines 82--87,
We determine if either ParMETIS_ or PT-SCOTCH_ is available in lines 81--86,
and use the corresponding wrapper provided by the AMGCL. The wrapper computes
the permutation matrix :math:`P`, which is used to reorder both the system
matrix and the RHS vector. Since the reordering may change the number of rows
Expand All @@ -156,8 +160,8 @@ owned by each MPI process, we update the number of local rows stored in the
.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 113-129
:lineno-start: 113
:lines: 112-128
:lineno-start: 112

At this point we are ready to initialize the solver (line 115), and solve the
system (line 128). Here is the output of the compiled program. Note that the
Expand Down Expand Up @@ -200,15 +204,15 @@ not support the mixed-precision approach, we will use the VexCL_ backend, which
allows to employ CUDA, OpenCL, or OpenMP compute devices. The source code
(`tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp`_) is very similar to the
version using the builtin backend and is shown below with the differences
highligted.
highlighted.

.. _VexCL: https://github.com/ddemidov/vexcl
.. _tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
:language: cpp
:linenos:
:emphasize-lines: 4,36-42,68,77-78,114,127-129,142-143
:emphasize-lines: 4,36-42,67,76-77,113,126-128,131,141-142

Basically, we replace the ``builtin`` backend with the ``vexcl`` one,
initialize the VexCL context and reference the context in the backend
Expand Down
1 change: 1 addition & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,5 @@ GPU backends were obtained with the NVIDIA GeForce GTX 1050 Ti GPU.
poisson3Db
poisson3DbMPI
Serena
SerenaMPI
CoupCons3D
3 changes: 1 addition & 2 deletions tutorial/1.poisson3Db/poisson3Db_mpi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@ int main(int argc, char *argv[]) {
size_t rows = amgcl::io::crs_size<size_t>(argv[1]);
size_t cols;

// Split the matrix into approximately equal chunks of rows, so that each
// chunk size is still divisible by the block size.
// Split the matrix into approximately equal chunks of rows
size_t chunk = (rows + world.size - 1) / world.size;
size_t row_beg = std::min(rows, chunk * world.rank);
size_t row_end = std::min(rows, row_beg + chunk);
Expand Down
3 changes: 1 addition & 2 deletions tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,7 @@ int main(int argc, char *argv[]) {
size_t rows = amgcl::io::crs_size<size_t>(argv[1]);
size_t cols;

// Split the matrix into approximately equal chunks of rows, so that each
// chunk size is still divisible by the block size.
// Split the matrix into approximately equal chunks of rows
size_t chunk = (rows + world.size - 1) / world.size;
size_t row_beg = std::min(rows, chunk * world.rank);
size_t row_end = std::min(rows, row_beg + chunk);
Expand Down
22 changes: 20 additions & 2 deletions tutorial/2.Serena/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,24 @@ if (TARGET mpi_target)
add_executable(serena_mpi serena_mpi.cpp)
target_link_libraries(serena_mpi amgcl mpi_target)

add_executable(serena_mpi_mixed serena_mpi_mixed.cpp)
target_link_libraries(serena_mpi_mixed amgcl mpi_target)
if (TARGET scotch_target)
target_link_libraries(serena_mpi scotch_target)
endif()

if (TARGET Metis::metis)
target_link_libraries(serena_mpi Metis::metis)
endif()

if (VexCL_FOUND)
vexcl_add_executables(serena_mpi_vexcl serena_mpi_vexcl.cpp)
target_link_libraries(serena_mpi_vexcl INTERFACE amgcl mpi_target)

if (TARGET scotch_target)
target_link_libraries(serena_mpi_vexcl INTERFACE scotch_target)
endif()

if (TARGET Metis::metis)
target_link_libraries(serena_mpi_vexcl INTERFACE Metis::metis)
endif()
endif()
endif()

0 comments on commit a7bffcf

Please sign in to comment.