MPI version of the Structural problem tutorial

ddemidov · Nov 23, 2020 · a7bffcf · a7bffcf
1 parent 9867062
commit a7bffcf
Show file tree

Hide file tree

Showing 10 changed files with 360 additions and 113 deletions.
diff --git a/docs/Serena.rst b/docs/Serena.rst
@@ -362,7 +362,7 @@ contains the complete source for the solution (available at
 backend version are highlighted.
 
 .. literalinclude:: ../tutorial/2.Serena/serena_vexcl.cpp
-   :caption: The solution of the Serena problem with VexCL backend.
+   :caption: The solution of the Serena problem with the VexCL backend.
    :language: cpp
    :linenos:
    :emphasize-lines: 4-5,28-29,32-33,81-82,98-99,106,119-120,127-128,131
@@ -413,7 +413,7 @@ The output of the program is shown below::
     Iters: 162
     Error: 9.74928e-09
 
-    [Serena:          27.208 s] (100.00%)
+    [Serena (VexCL):  27.208 s] (100.00%)
     [ self:            0.180 s] (  0.66%)
     [  GPU matrix:     0.604 s] (  2.22%)
     [  read:          18.699 s] ( 68.73%)

diff --git a/docs/SerenaMPI.rst b/docs/SerenaMPI.rst
@@ -1,15 +1,146 @@
 Structural problem (MPI version)
 --------------------------------
 
-In this section we look at how to use the MPI version of the AMGCL solver for
+In this section we look at how to use the MPI version of the AMGCL solver with
 the Serena_ system. We have already determined in the :doc:`Serena` section
 that the system is best solved with the block-valued backend, and needs to be
-scaled so that it has the unit diagonal.
+scaled so that it has the unit diagonal. The MPI solution will be very closer
+to the one we have seen in the :doc:`poisson3DbMPI` section. The only
+differences are:
 
 .. _Serena: https://sparse.tamu.edu/Janna/Serena
 
+- The system needs to be scaled so that it has the unit diagonal. This is
+  complicated by the fact that the matrix product :math:`D^{-1/2} A D^{-1/2}`
+  has to done in the distributed memory environment.
+- The solution has to use the block-valued backend, and the partitioning needs
+  to take this into account. Namely, the partitioning should not split any of
+  the :math:`3\times3` blocks between MPI processes.
+- Even though the system is symmetric, the convergence of the CG solver in the
+  distributed case stalls at the relative error about :math:`10^{-6}`.
+  Switching to the BiCGStab solver helps with the convergence.
+
+The next listing is the MPI version of the Serena_ system solver
+(`tutorial/2.Serena/serena_mpi.cpp`_).  In the following paragraphs we
+highlight the differences between this version and the code in the
+:doc:`poisson3DbMPI` and :doc:`Serena` sections.
+
+.. _tutorial/2.Serena/serena_mpi.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/2.Serena/serena_mpi.cpp
+
 .. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
    :caption: The MPI solution of the Serena problem
    :language: cpp
    :linenos:
 
+We make sure that the paritioning takes the block structure of the matrix into
+account by keeping the number of rows in the initial naive partitioning
+divisible by 3 (here the constant ``B`` is equal to 3):
+
+.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
+   :language: cpp
+   :linenos:
+   :lines: 46-53
+   :lineno-start: 46
+
+We also create all the distributed matrices using the block values, so the
+partitioning naturally is block-aware. We are using the mixed precision
+approach, so the preconditioner backend is defined with the single precision:
+
+.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
+   :language: cpp
+   :linenos:
+   :lines: 65-79
+   :lineno-start: 65
+
+The scaling is done similarly to how we apply the reordering: first, we find
+the diagonal of the local diagonal block on each of the MPI processes, and then
+we create the distributed diagonal matrix with the inverted square root of the
+system matrix diagonal. After that, the scaled matrix :math:`A_s = D^{-1/2} A
+D^{-1/2}` is computed using the :cpp:func:`amgcl::mpi::product()` function.
+The scaled RHS vector :math:`f_s = D^{-1/2} f` in principle may be found using
+the :cpp:func:`amgcl::backend::spmv()` primitive, but, since the RHS vector in
+this case is simply filled with ones, the scaled RHS :math:`f_s = D^{-1/2}`.
+
+.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
+   :language: cpp
+   :linenos:
+   :lines: 85-125
+   :lineno-start: 85
+
+Here is the output from the compiled program::
+
+    $ export OMP_NUM_THREADS=1
+    $ mpirun -np 4 ./serena_mpi Serena.bin 
+    World size: 4
+    Matrix Serena.bin: 1391349x1391349
+    Partitioning[ParMETIS] 4 -> 4
+    Type:             BiCGStab
+    Unknowns:         118533
+    Memory footprint: 18.99 M
+
+    Number of levels:    4
+    Operator complexity: 1.27
+    Grid complexity:     1.07
+
+    level     unknowns       nonzeros
+    ---------------------------------
+        0       463783        7170189 (79.04%) [4]
+        1        32896        1752778 (19.32%) [4]
+        2         1698         144308 ( 1.59%) [4]
+        3           95           4833 ( 0.05%) [4]
+
+    Iterations: 80
+    Error:      9.34355e-09
+
+    [Serena MPI:     24.840 s] (100.00%)
+    [  partition:     1.159 s] (  4.67%)
+    [  read:          0.265 s] (  1.07%)
+    [  scale:         0.583 s] (  2.35%)
+    [  setup:         0.811 s] (  3.26%)
+    [  solve:        22.017 s] ( 88.64%)
+
+The version that uses the VexCL backend should be familiar at this point. Below
+is the source code (`tutorial/2.Serena/serena_mpi_vexcl.cpp`_) where the
+differences with the builtin backend version are highlighted:
+
+.. _tutorial/2.Serena/serena_mpi_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/2.Serena/serena_mpi_vexcl.cpp
+
+.. literalinclude:: ../tutorial/2.Serena/serena_mpi_vexcl.cpp
+   :caption: The MPI solution of the Serena problem using the VexCL backend
+   :language: cpp
+   :linenos:
+   :emphasize-lines: 4-5,40-52,84-85,100-102,144,168-169,184,193-194
+
+Here is the output of the MPI version with the VexCL backend::
+
+    $ export OMP_NUM_THREADS=1
+    $ mpirun -np 2 ./serena_mpi_vexcl_cl Serena.bin 
+    0: GeForce GTX 960 (NVIDIA CUDA)
+    1: GeForce GTX 1050 Ti (NVIDIA CUDA)
+    World size: 2
+    Matrix Serena.bin: 1391349x1391349
+    Partitioning[ParMETIS] 2 -> 2
+    Type:             BiCGStab
+    Unknowns:         231112
+    Memory footprint: 37.03 M
+
+    Number of levels:    4
+    Operator complexity: 1.27
+    Grid complexity:     1.07
+
+    level     unknowns       nonzeros
+    ---------------------------------
+        0       463783        7170189 (79.01%) [2]
+        1        32887        1754795 (19.34%) [2]
+        2         1708         146064 ( 1.61%) [2]
+        3           85           4059 ( 0.04%) [2]
+
+    Iterations: 83
+    Error:      9.80582e-09
+
+    [Serena MPI(VexCL):    10.943 s] (100.00%)
+    [  partition:           1.357 s] ( 12.40%)
+    [  read:                0.370 s] (  3.38%)
+    [  scale:               0.729 s] (  6.66%)
+    [  setup:               1.966 s] ( 17.97%)
+    [  solve:               6.512 s] ( 59.51%)
diff --git a/docs/poisson3DbMPI.rst b/docs/poisson3DbMPI.rst
@@ -7,7 +7,11 @@ approach. Lets solve the same problem using the Message Passing Interface
 (MPI), or the distributed memory approach. We already know that using the
 smoothed aggregation AMG with the simple SPAI(0) smoother is working well, so
 we may start writing the code immediately.  The following is the complete
-MPI-based implementation of the solver.  We discuss it in more details below.
+MPI-based implementation of the solver
+(`tutorial/1.poisson3Db/poisson3Db_mpi.cpp`_).  We discuss it in more details
+below.
+
+.. _tutorial/1.poisson3Db/poisson3Db_mpi.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db_mpi.cpp
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :caption: The MPI solution of the poisson3Db problem
@@ -93,13 +97,13 @@ have as much as twice fewer non-zeros compared to the naive partitioning of the
 matrix. The solution :math:`x` in the original ordering may be obtained with
 :math:`x = P y`.
 
-In lines 37--55 we read the system matrix and the RHS vector using the naive
+In lines 37--54 we read the system matrix and the RHS vector using the naive
 ordering (better ordering of the unknowns will be determined later):
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :language: cpp
    :linenos:
-   :lines: 37-55
+   :lines: 37-54
    :lineno-start: 37
 
 First, we read the total (global) number of rows in the matrix from the binary
@@ -110,43 +114,43 @@ the RHS using :cpp:func:`amgcl::io::read_crs()` and
 parameters to the functions specify the regions (in row numbers) to read. The
 column indices are kept in global numbering.
 
-In lines 63--73 we define the backend and the solver types:
+In lines 62--72 we define the backend and the solver types:
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :language: cpp
    :linenos:
-   :lines: 63-73
-   :lineno-start: 63
+   :lines: 62-72
+   :lineno-start: 62
 
 The structure of the solver is the same as in the shared memory case in the
 :doc:`poisson3Db` tutorial, but we are using the components from the
 ``amgcl::mpi`` namespace. Again, we are using the mixed-precision approach and
 the preconditioner backend is defined with a single-precision value type.
 
-In lines 75--77 we create the distributed matrix from the local strips read by
+In lines 74--76 we create the distributed matrix from the local strips read by
 each of the MPI processes:
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :language: cpp
    :linenos:
-   :lines: 75-77
-   :lineno-start: 75
+   :lines: 74-76
+   :lineno-start: 74
 
 We could directly use the tuple of the CRS arrays ``std::tie(chunk, ptr, col,
 val)`` to construct the solver (the distributed matrix would be created behind
 the scenes for us), but here we need to explicitly create the matrix for a
 couple of reasons. First, since we are using the mixed-precision approach, we
 need the double-precision distributed matrix for the solution step. And second,
 the matrix will be used to repartition the system using either ParMETIS_ or
-PT-SCOTCH_ libraries in lines 79--111:
+PT-SCOTCH_ libraries in lines 78--110:
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :language: cpp
    :linenos:
-   :lines: 79-111
-   :lineno-start: 79
+   :lines: 78-110
+   :lineno-start: 78
 
-We determine if either ParMETIS_ or PT-SCOTCH_ is available in lines 82--87,
+We determine if either ParMETIS_ or PT-SCOTCH_ is available in lines 81--86,
 and use the corresponding wrapper provided by the AMGCL. The wrapper computes
 the permutation matrix :math:`P`, which is used to reorder both the system
 matrix and the RHS vector. Since the reordering may change the number of rows
@@ -156,8 +160,8 @@ owned by each MPI process, we update the number of local rows stored in the
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
    :language: cpp
    :linenos:
-   :lines: 113-129
-   :lineno-start: 113
+   :lines: 112-128
+   :lineno-start: 112
 
 At this point we are ready to initialize the solver (line 115), and solve the
 system (line 128). Here is the output of the compiled program. Note that the
@@ -200,15 +204,15 @@ not support the mixed-precision approach, we will use the VexCL_ backend, which
 allows to employ CUDA, OpenCL, or OpenMP compute devices. The source code
 (`tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp`_) is very similar to the
 version using the builtin backend and is shown below with the differences
-highligted.
+highlighted.
 
 .. _VexCL: https://github.com/ddemidov/vexcl
 .. _tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
 
 .. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
    :language: cpp
    :linenos:
-   :emphasize-lines: 4,36-42,68,77-78,114,127-129,142-143
+   :emphasize-lines: 4,36-42,67,76-77,113,126-128,131,141-142
 
 Basically, we replace the ``builtin`` backend with the ``vexcl`` one,
 initialize the VexCL context and reference the context in the backend

diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -15,4 +15,5 @@ GPU backends were obtained with the NVIDIA GeForce GTX 1050 Ti GPU.
     poisson3Db
     poisson3DbMPI
     Serena
+    SerenaMPI
     CoupCons3D
diff --git a/tutorial/1.poisson3Db/poisson3Db_mpi.cpp b/tutorial/1.poisson3Db/poisson3Db_mpi.cpp
@@ -40,8 +40,7 @@ int main(int argc, char *argv[]) {
     size_t rows = amgcl::io::crs_size<size_t>(argv[1]);
     size_t cols;
 
-    // Split the matrix into approximately equal chunks of rows, so that each
-    // chunk size is still divisible by the block size.
+    // Split the matrix into approximately equal chunks of rows
     size_t chunk = (rows + world.size - 1) / world.size;
     size_t row_beg = std::min(rows, chunk * world.rank);
     size_t row_end = std::min(rows, row_beg + chunk);

diff --git a/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp b/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
@@ -50,8 +50,7 @@ int main(int argc, char *argv[]) {
     size_t rows = amgcl::io::crs_size<size_t>(argv[1]);
     size_t cols;
 
-    // Split the matrix into approximately equal chunks of rows, so that each
-    // chunk size is still divisible by the block size.
+    // Split the matrix into approximately equal chunks of rows
     size_t chunk = (rows + world.size - 1) / world.size;
     size_t row_beg = std::min(rows, chunk * world.rank);
     size_t row_end = std::min(rows, row_beg + chunk);

diff --git a/tutorial/2.Serena/CMakeLists.txt b/tutorial/2.Serena/CMakeLists.txt
@@ -10,6 +10,24 @@ if (TARGET mpi_target)
     add_executable(serena_mpi serena_mpi.cpp)
     target_link_libraries(serena_mpi amgcl mpi_target)
 
-    add_executable(serena_mpi_mixed serena_mpi_mixed.cpp)
-    target_link_libraries(serena_mpi_mixed amgcl mpi_target)
+    if (TARGET scotch_target)
+        target_link_libraries(serena_mpi scotch_target)
+    endif()
+
+    if (TARGET Metis::metis)
+        target_link_libraries(serena_mpi Metis::metis)
+    endif()
+
+    if (VexCL_FOUND)
+        vexcl_add_executables(serena_mpi_vexcl serena_mpi_vexcl.cpp)
+        target_link_libraries(serena_mpi_vexcl INTERFACE amgcl mpi_target)
+
+        if (TARGET scotch_target)
+            target_link_libraries(serena_mpi_vexcl INTERFACE scotch_target)
+        endif()
+
+        if (TARGET Metis::metis)
+            target_link_libraries(serena_mpi_vexcl INTERFACE Metis::metis)
+        endif()
+    endif()
 endif()