Skip to content

Commit

Permalink
Poisson with MPI tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
ddemidov committed Nov 23, 2020
1 parent ebcf3e5 commit 4db1ffa
Show file tree
Hide file tree
Showing 18 changed files with 1,032 additions and 13 deletions.
4 changes: 2 additions & 2 deletions docs/CoupCons3D.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
3D fully coupled poroelastic problem
====================================
Fully coupled poroelastic problem
---------------------------------

This system may be downloaded from the CoupCons3D_ page (use the `Matrix
Market`_ download option). According to the description, the system has been
Expand Down
6 changes: 3 additions & 3 deletions docs/Serena.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
3D structural problem
---------------------
Structural problem
------------------

This system may be downloaded from the Serena_ page (use the `Matrix Market`_
download option). According to the description, the system represents a gas
download option). According to the description, the system represents a 3D gas
resevoir simulation for CO2 sequestration, and was obtained from a structural
problem discretizing a gas reservoir with tetrahedral Finite Elements. The
medium is strongly heterogeneous and characterized by a complex geometry
Expand Down
15 changes: 15 additions & 0 deletions docs/SerenaMPI.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Structural problem (MPI version)
--------------------------------

In this section we look at how to use the MPI version of the AMGCL solver for
the Serena_ system. We have already determined in the :doc:`Serena` section
that the system is best solved with the block-valued backend, and needs to be
scaled so that it has the unit diagonal.

.. _Serena: https://sparse.tamu.edu/Janna/Serena

.. literalinclude:: ../tutorial/2.Serena/serena_mpi.cpp
:caption: The MPI solution of the Serena problem
:language: cpp
:linenos:

1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
copyright = '2012-2020, Denis Demidov <dennis.demidov@gmail.com>'
author = 'Denis Demidov'
numfig = True
plot_include_source = False

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
Expand Down
2 changes: 1 addition & 1 deletion docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ The solution to the problem looks like this:
for k in range(1,20,2):
u -= 16/pi**3 * (sin(k*pi*(1+x)/2) / (k**3 * sinh(k * pi))
* (sinh(k * pi * (1 + y) / 2) + sinh(k * pi * (1 - y)/2)))
figure(figsize=(3,3))
figure(figsize=(5,5))
imshow(u, extent=(0,1,0,1))
show()

Expand Down
13 changes: 7 additions & 6 deletions docs/poisson3Db.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
3D Poisson problem
------------------
Poisson problem
---------------

This system may be downloaded from the poisson3Db_ page (use the `Matrix
Market`_ download option). The system matrix has 85,623 rows and 2,374,949
Expand Down Expand Up @@ -296,10 +296,11 @@ solution using an NVIDIA GPU. We only need to use the
:cpp:class:`amgcl::backend::cuda` instead of the ``builtin`` backend, and we
also need to initialize the CUSPARSE library and pass the handle to AMGCL as
the backend parameters. Unfortunately, we can not use the mixed precision
approach, as CUSPARSE does not support that (we could use a VexCL_ backend
though). The source code is very close to what we have seen above and is
available at `tutorial/1.poisson3Db/poisson3Db_cuda.cu`_. The listing below has
the differences highligted:
approach, as CUSPARSE does not support that (we could use the VexCL_ backend
though, see :doc:`poisson3DbMPI`). The source code is very close to what we
have seen above and is available at
`tutorial/1.poisson3Db/poisson3Db_cuda.cu`_. The listing below has the
differences highligted:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_cuda.cu
:caption: The source code for the solution of the poisson3Db problem using the CUDA backend.
Expand Down
254 changes: 254 additions & 0 deletions docs/poisson3DbMPI.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
Poisson problem (MPI version)
-----------------------------

In section :doc:`poisson3Db` we looked at the solution of the 3D Poisson
problem (available for download at poisson3Db_ page) using the shared memory
approach. Lets solve the same problem using the Message Passing Interface
(MPI), or the distributed memory approach. We already know that using the
smoothed aggregation AMG with the simple SPAI(0) smoother is working well, so
we may start writing the code immediately. The following is the complete
MPI-based implementation of the solver. We discuss it in more details below.

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:caption: The MPI solution of the poisson3Db problem
:language: cpp
:linenos:

In lines 4--21 we include the required components. Here we are using the
builtin (OpenMP-based) backend and the CRS tuple adapter. Next we include
MPI-specific headers that provide the distributed-memory implementation of
AMGCL algorithms. This time, we are reading the system matrix and the RHS
vector in the binary format, and include ``<amgcl/io/binary.hpp>`` header
intead of the usual ``<amgcl/io/mm.hpp>``. The binary format is not only faster
to read, but it also allows to read the matrix and the RHS vector in chunks,
which is what we need for the distributed approach.

After checking the validity of the command line parameters, we initialize the
MPI context and communicator in lines 31--32:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 31-32
:lineno-start: 31

The :cpp:class:`amgcl::mpi::init` is a convenience RAII wrapper for
:cpp:func:`MPI_Init()`. It will call :cpp:func:`MPI_Finalize` in the destructor
when its instance (``mpi``) goes out of scope at the end of the program. We
don't have to use the wrapper, but it simply makes things easier.
:cpp:class:`amgcl::mpi::communicator` is an equally thing wrapper for
``MPI_Comm``. :cpp:class:`amgcl::mpi::communicator` and ``MPI_Comm`` may be
used interchangeably both with the AMGCL MPI interface and the native MPI
functions.

The system has to be divided (partitioned) between multiple MPI processes. The
simplest way to do this is presented on the following figure:

.. _poisson3Db: https://sparse.tamu.edu/FEMLAB/poisson3Db

.. figure:: ../tutorial/1.poisson3Db/Poisson3D_mpi.png
:width: 90%

Poisson3Db matrix partitioned between the 4 MPI processes

Assuming we are using 4 MPI processes, the matrix is split into 4 continuous
chunks of rows, so that each MPI process owns approximately 25% of the matrix.
This works well enough for a small number of processes, but as the size of the
compute cluster grows, the simple partitioning becomes less and less efficient.
Creating efficient partitioning is outside of AMGCL scope, but AMGCL does
provide wrappers for such libraries as ParMETIS_ and PT-SCOTCH_.
The difference between the naive and the optimal partitioning is demonstrated
on the next figure:

.. _ParMETIS: http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview
.. _PT-SCOTCH: https://www.labri.fr/perso/pelegrin/scotch/

.. plot:: ../tutorial/1.poisson3Db/partition.py

Naive vs optimal partitioning of a :math:`4\times4` grid between 4 MPI processes.

The figure shows the finite-diffrence discretization of a 2D Poisson problem on
a :math:`4\times4` grid in a unit square. The nonzero pattern of the system
matrix is presented on the lower left plot. If the grid nodes are numbered
row-wise, then the naive partitioning of the system matrix for the 4 MPI processes is shown on the
upper left plot. The domains belonging to each of the MPI processes
correspond to the continuous sets of grid node indices and are elongated along
the X axis. This results in high MPI communication traffic, as the number of
the interface nodes is high relative to the number of interior nodes.
The upper right plot shows the optimal partitioning of the domain for the 4 MPI
processes. In order to keep the rows owned by a single MPI process
adjacent to each other (so that each MPI process owns a continuous set of rows,
as required by AMGCL), the grid nodes have to be renumbered. The labels in the
top left corner of each grid node show the original numbering, and the
lower-rigth labels show the new numbering. The renumbering of the matrix may be
represented with help of the permutation matrix :math:`P`, where :math:`P_{ij}
= 1` if the :math:`j`-th unknown in the original ordering is mapped to the
:math:`i`-th unknown in the new ordering. The reordered system may be
represented as

.. math:: P^T A P y = P^T f

The reordered matrix :math:`P^T A P` and the corresponding partitioning are
shown on lower right plot. Note that off-diagonal blocks on each MPI process
have as much as twice fewer non-zeros compared to the naive partitioning of the
matrix. The solution :math:`x` in the original ordering may be obtained with
:math:`x = P y`.

In lines 37--55 we read the system matrix and the RHS vector using the naive
ordering (better ordering of the unknowns will be determined later):

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 37-55
:lineno-start: 37

First, we read the total (global) number of rows in the matrix from the binary
file using the :cpp:func:`amgcl::io::crs_size()` function. Next, we divide the
global rows between the MPI processes, and read our portions of the matrix and
the RHS using :cpp:func:`amgcl::io::read_crs()` and
:cpp:func:`amgcl::io::read_dense()` functions. The ``row_beg`` and ``row_end``
parameters to the functions specify the regions (in row numbers) to read. The
column indices are kept in global numbering.

In lines 63--73 we define the backend and the solver types:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 63-73
:lineno-start: 63

The structure of the solver is the same as in the shared memory case in the
:doc:`poisson3Db` tutorial, but we are using the components from the
``amgcl::mpi`` namespace. Again, we are using the mixed-precision approach and
the preconditioner backend is defined with a single-precision value type.

In lines 75--77 we create the distributed matrix from the local strips read by
each of the MPI processes:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 75-77
:lineno-start: 75

We could directly use the tuple of the CRS arrays ``std::tie(chunk, ptr, col,
val)`` to construct the solver (the distributed matrix would be created behind
the scenes for us), but here we need to explicitly create the matrix for a
couple of reasons. First, since we are using the mixed-precision approach, we
need the double-precision distributed matrix for the solution step. And second,
the matrix will be used to repartition the system using either ParMETIS_ or
PT-SCOTCH_ libraries in lines 79--111:

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 79-111
:lineno-start: 79

We determine if either ParMETIS_ or PT-SCOTCH_ is available in lines 82--87,
and use the corresponding wrapper provided by the AMGCL. The partitioning is
the used to compute the permutation matrix :math:`P`, which used to reorder
both the system matrix and the RHS vector. Since the reordering may change the
number of rows owned by each MPI process, we update the number of local rows
stored in the ``chunk`` variable.

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi.cpp
:language: cpp
:linenos:
:lines: 113-129
:lineno-start: 113

At this point we are ready to initialize the solver (line 115), and solve the
system (line 128). Here is the output of the compiled program. Note that the
environment variable ``OMP_NUM_THREADS`` is set to 1 in order to not
oversubscribe the available CPU cores::

$ export OMP_NUM_THREADS=1
$ mpirun -np 4 ./poisson3Db_mpi poisson3Db.bin poisson3Db_b.bin
World size: 4
Matrix poisson3Db.bin: 85623x85623
RHS poisson3Db_b.bin: 85623x1
Partitioning[ParMETIS] 4 -> 4
Type: BiCGStab
Unknowns: 21671
Memory footprint: 1.16 M

Number of levels: 3
Operator complexity: 1.20
Grid complexity: 1.08

level unknowns nonzeros
---------------------------------
0 85623 2374949 (83.06%) [4]
1 6377 450473 (15.75%) [4]
2 401 34039 ( 1.19%) [4]

Iters: 24
Error: 6.09835e-09

[poisson3Db MPI: 1.273 s] (100.00%)
[ self: 0.044 s] ( 3.49%)
[ partition: 0.626 s] ( 49.14%)
[ read: 0.012 s] ( 0.93%)
[ setup: 0.152 s] ( 11.92%)
[ solve: 0.439 s] ( 34.52%)

Similarly to how it was done in the :doc:`poisson3Db` section, we can use the
GPU backend in order to speed up the solution step. Since the CUDA backend does
not support the mixed-precision approach, we will use the VexCL_ backend, which
allows to employ CUDA, OpenCL, or OpenMP compute devices. The source code
(`tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp`_) is very similar to the
version using the builtin backend and is shown below with the differences
highligted.

.. _VexCL: https://github.com/ddemidov/vexcl
.. _tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp

.. literalinclude:: ../tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp
:language: cpp
:linenos:
:emphasize-lines: 4,36-42,68,77-78,114,127-129,142-143

Basically, we replace the ``builtin`` backend with the ``vexcl`` one,
initialize the VexCL context and reference the context in the backend
parameters. The RHS and the solution vectors are need to be
transfered/allocated on the GPUs.
Below is the output of the VexCL version using the OpenCL technology. Note that
the system the tests were performed on has only two GPUs, so the test used just
two MPI processes. The environment variable ``OMP_NUM_THREADS`` was set to 2 in
order to fully utilize all available CPU cores::

$ export OMP_NUM_THREADS=2
$ mpirun -np 2 ./poisson3Db_mpi_vexcl_cl poisson3Db.bin poisson3Db_b.bin
0: GeForce GTX 960 (NVIDIA CUDA)
1: GeForce GTX 1050 Ti (NVIDIA CUDA)
World size: 2
Matrix poisson3Db.bin: 85623x85623
RHS poisson3Db_b.bin: 85623x1
Partitioning[ParMETIS] 2 -> 2
Type: BiCGStab
Unknowns: 43255
Memory footprint: 2.31 M

Number of levels: 3
Operator complexity: 1.20
Grid complexity: 1.08

level unknowns nonzeros
---------------------------------
0 85623 2374949 (83.03%) [2]
1 6381 451279 (15.78%) [2]
2 396 34054 ( 1.19%) [2]

Iters: 24
Error: 9.14603e-09

[poisson3Db MPI(VexCL): 1.132 s] (100.00%)
[ self: 0.040 s] ( 3.56%)
[ partition: 0.607 s] ( 53.58%)
[ read: 0.015 s] ( 1.31%)
[ setup: 0.287 s] ( 25.31%)
[ solve: 0.184 s] ( 16.24%)

1 change: 0 additions & 1 deletion docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,4 @@ GPU backends were obtained with the NVIDIA GeForce GTX 1050 Ti GPU.
poisson3Db
poisson3DbMPI
Serena
SerenaMPI
CoupCons3D
13 changes: 13 additions & 0 deletions tutorial/1.poisson3Db/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,17 @@ if(TARGET mpi_target)
if (TARGET Metis::metis)
target_link_libraries(poisson3Db_mpi Metis::metis)
endif()

if (VexCL_FOUND)
vexcl_add_executables(poisson3Db_mpi_vexcl poisson3Db_mpi_vexcl.cpp)
target_link_libraries(poisson3Db_mpi_vexcl INTERFACE amgcl mpi_target)

if (TARGET scotch_target)
target_link_libraries(poisson3Db_mpi_vexcl INTERFACE scotch_target)
endif()

if (TARGET Metis::metis)
target_link_libraries(poisson3Db_mpi_vexcl INTERFACE Metis::metis)
endif()
endif()
endif()
Binary file added tutorial/1.poisson3Db/Poisson3D_mpi.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4db1ffa

Please sign in to comment.