Add RankMap and use it to get some more diagnostic capability for par…

…titioning and fix a big in MemoryUsageReporter closes idaholab#12629
friedmud · Jan 10, 2019 · ecd632a · ecd632a
1 parent cc9c7e9
commit ecd632a
Show file tree

Hide file tree

Showing 20 changed files with 418 additions and 80 deletions.
diff --git a/framework/doc/content/media/auxkernels/hardware_id_aux.png b/framework/doc/content/media/auxkernels/hardware_id_aux.png
diff --git a/framework/doc/content/media/vectorpostprocessors/work_balance_hardware_id.png b/framework/doc/content/media/vectorpostprocessors/work_balance_hardware_id.png
diff --git a/framework/doc/content/source/auxkernels/HardwareIDAux.md b/framework/doc/content/source/auxkernels/HardwareIDAux.md
@@ -0,0 +1,47 @@
+# HardwareIDAux
+
+!syntax description /AuxKernels/HardwareIDAux
+
+## Description
+
+One of the main purposes of this object is to aid in the diagnostic of mesh partitioners.  One metric to look at for mesh partitioners is how well they keep down inter-node (compute node) communication.  `HardwareIDAux` allows you to visually see the mapping of elements to compute nodes in your job.
+
+This is particularly interesting in the case of the [PetscExternalPartitioner](PetscExternalPartitioner.md) which has the capability to do "hierarchical" partitioning.  Hierarchical partitioning makes it possible to partition over compute-nodes first... then within compute nodes, in order to better respect the physical topology of the compute cluster.
+
+One important aspect of that is that how you launch your parallel job can matter quite a bit to partitioning.  In-general, it's better for partitioners if all of the ranks of your job are contiguously assigned to each compute node.  Here are four different ways, and the outcome using `HardwareIDAux`, to launch a job using a 100x100 generated mesh on 16 processes and 4 ndoes with two different partitioner...
+
+Top left (METIS):
+
+```
+mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 ../../../moose_test-opt -i hardware_id_aux.i
+```
+
+Top right (Hierarchic):
+
+```
+mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 ../../../moose_test-opt -i hardware_id_aux.i -mat_partitioning_hierarchical_nfineparts 4
+```
+
+Bottom left (METIS):
+
+```
+mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 -ppn 4 ../../../moose_test-opt -i hardware_id_aux.i
+```
+
+Bottom right (Hierarchic):
+
+```
+mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 -ppn 4 ../../../moose_test-opt -i hardware_id_aux.i -mat_partitioning_hierarchical_nfineparts 4
+```
+
+It should be immediately apparent that the bottom right partitioning is best (will reduce the amount of inter-node communication).  That result was achieved by using hierarchical partitioning and using `-ppn 4` to tell `mpiexec` to put `4` processes on each compute node... which will cause those four processes to be contiguous on each node.  The top two examples, which omit the `-ppn` option, end up getting "striped" mpi processes (one process is placed on each node and then it wraps around) causing a jumbly mess of partitioning which will increase the communication cost for the job (and decrease scalability).
+
+!media media/auxkernels/hardware_id_aux.png style=width:75%
+
+!syntax parameters /AuxKernels/HardwareIDAux
+
+!syntax inputs /AuxKernels/HardwareIDAux
+
+!syntax children /AuxKernels/HardwareIDAux
+
+!bibtex bibliography
diff --git a/framework/doc/content/source/vectorpostprocessors/WorkBalance.md b/framework/doc/content/source/vectorpostprocessors/WorkBalance.md
@@ -10,11 +10,13 @@ The idea here is to compute per-processor metrics to help in determining the qua
 
 Currently computes: number of local elements, nodes, dofs and partition sides.  The partition sides are the sides of elements that are on processor boundaries (also known as the "edge-cuts" in partitioner lingo).  Also computes the "surface area" of each partition (physically, how much processor boundary each partitioning has).
 
-## Important Notes
+### HardwareID
 
-Note that, by default, this VPP only computes the complete vector on processor 0.  The vectors this VPP computes may be very large and there is no need to have a copy of them on every processor.
+`WorkBalance` will now also compute the number of sides and the surface area for the partition on each compute node (called "hardware_id" here) in the cluster.  This gives the amount of "inter-node" communication.  Use of a hierarchical partitioner (like the one available in [PetscExternalPartitioner](PetscExternalPartitioner.md)) can help reduce inter-node communication.
 
-However, you can modify this behavior by setting `sync_to_all_procs = true`
+For instance, here is a 1600x1600 mesh partitioned to run on 64 nodes, each having 36 processors (2304 processors total).  Using `WorkBalance` and [VectorPostprocessorVisualizationAux](VectorPostprocessorVisualizationAux.md) we can visually see how much inter-node communication there is and quantify it.
+
+!media media/vectorpostprocessors/work_balance_hardware_id.png style=width:75% caption=Visualization of inter-node communication. Left: Parmetis, Right: Hierarchical.  Parmetis `hardware_id_surface_area`: 66.  Hierarchical `hardware_id_surface_area`: 39.
 
 !syntax parameters /VectorPostprocessors/WorkBalance
 

diff --git a/framework/include/auxkernels/HardwareIDAux.h b/framework/include/auxkernels/HardwareIDAux.h
@@ -0,0 +1,36 @@
+//* This file is part of the MOOSE framework
+//* https://www.mooseframework.org
+//*
+//* All rights reserved, see COPYRIGHT for full restrictions
+//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
+//*
+//* Licensed under LGPL 2.1, please see LICENSE for details
+//* https://www.gnu.org/licenses/lgpl-2.1.html
+
+#ifndef HARDWAREIDAUX_H
+#define HARDWAREIDAUX_H
+
+#include "AuxKernel.h"
+
+// Forward Declarations
+class HardwareIDAux;
+
+template <>
+InputParameters validParams<HardwareIDAux>();
+
+/**
+ * "Paints" the ID of of the physical "node" in the cluster the element
+ * is located on.  Useful for examining partition schemes.
+ */
+class HardwareIDAux : public AuxKernel
+{
+public:
+  HardwareIDAux(const InputParameters & parameters);
+
+protected:
+  virtual Real computeValue() override;
+
+  const RankMap & _rank_map;
+};
+
+#endif // HARDWAREIDAUX_H
diff --git a/framework/include/base/MooseApp.h b/framework/include/base/MooseApp.h
@@ -21,6 +21,7 @@
 #include "ConsoleStreamInterface.h"
 #include "PerfGraph.h"
 #include "TheWarehouse.h"
+#include "RankMap.h"
 
 #include "libmesh/parallel_object.h"
 #include "libmesh/mesh_base.h"
@@ -91,6 +92,12 @@ class MooseApp : public ConsoleStreamInterface, public libMesh::ParallelObject
    */
   const std::string & type() const { return _type; }
 
+  /**
+   * The RankMap is a useful object for determining how the processes
+   * are laid out on the physical nodes of the cluster
+   */
+  const RankMap & rankMap() { return _rank_map; }
+
   /**
    * Get the PerfGraph for this app
    */
@@ -704,6 +711,9 @@ class MooseApp : public ConsoleStreamInterface, public libMesh::ParallelObject
   /// The PerfGraph object for this applciation
   PerfGraph _perf_graph;
 
+  /// The RankMap is a useful object for determining how
+  const RankMap _rank_map;
+
   /// Input file name used
   std::string _input_filename;
 

diff --git a/framework/include/postprocessors/MemoryUsageReporter.h b/framework/include/postprocessors/MemoryUsageReporter.h
@@ -33,7 +33,7 @@ class MemoryUsageReporter
   processor_id_type _nrank;
 
   /// hardware IDs for each MPI rank (valid on rank zero only)
-  std::vector<unsigned int> _hardware_id;
+  const std::vector<unsigned int> & _hardware_id;
 
   /// total RAM installed in the local node
   unsigned long long _memory_total;

diff --git a/framework/include/utils/RankMap.h b/framework/include/utils/RankMap.h
@@ -0,0 +1,65 @@
+//* This file is part of the MOOSE framework
+//* https://www.mooseframework.org
+//*
+//* All rights reserved, see COPYRIGHT for full restrictions
+//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
+//*
+//* Licensed under LGPL 2.1, please see LICENSE for details
+//* https://www.gnu.org/licenses/lgpl-2.1.html
+
+#ifndef RANKMAP_H
+#define RANKMAP_H
+
+#include "PerfGraphInterface.h"
+
+#include "libmesh/parallel_object.h"
+
+/**
+ * Builds lists and maps that help in knowing which physical hardware nodes each rank is on.
+ *
+ * Note: large chunks of this code were originally committed by @dschwen in PR #12351
+ *
+ * https://github.com/idaholab/moose/pull/12351
+ */
+class RankMap : ParallelObject, PerfGraphInterface
+{
+public:
+  /**
+   * Constructs and fills the map
+   */
+  RankMap(const Parallel::Communicator & comm, PerfGraph & perf_graph);
+
+  /**
+   * Returns the "hardware ID" (a unique ID given to each physical compute node in the job)
+   * for a given processor ID (rank)
+   */
+  unsigned int hardwareID(processor_id_type pid) const { return _rank_to_hardware_id[pid]; }
+
+  /**
+   * Returns the ranks that are on the given hardwareID (phsical node in the job)
+   */
+  const std::vector<processor_id_type> & ranks(unsigned int hardware_id) const
+  {
+    auto item = _hardware_id_to_ranks.find(hardware_id);
+    if (item == _hardware_id_to_ranks.end())
+      mooseError("hardware_id not found");
+
+    return item->second;
+  }
+
+  /**
+   * Vector containing the hardware ID for each PID
+   */
+  const std::vector<unsigned int> & rankHardwareIds() const { return _rank_to_hardware_id; }
+
+protected:
+  PerfID _construct_timer;
+
+  /// Map of hardware_id -> ranks on that node
+  std::map<unsigned int, std::vector<processor_id_type>> _hardware_id_to_ranks;
+
+  /// Each entry corresponds to the hardware_id for that PID
+  std::vector<unsigned int> _rank_to_hardware_id;
+};
+
+#endif
diff --git a/framework/include/vectorpostprocessors/WorkBalance.h b/framework/include/vectorpostprocessors/WorkBalance.h
@@ -43,6 +43,11 @@ class WorkBalance : public GeneralVectorPostprocessor
   /// The system to count DoFs from
   int _system;
 
+  /// Helpful in determining the physical layout of the ranks
+  const RankMap & _rank_map;
+
+  unsigned int _my_hardware_id;
+
   bool _sync_to_all_procs;
 
   dof_id_type _local_num_elems;
@@ -51,12 +56,18 @@ class WorkBalance : public GeneralVectorPostprocessor
   dof_id_type _local_num_partition_sides;
   Real _local_partition_surface_area;
 
+  // These are measuring the size of inter-nodal (compute nodes) communication
+  dof_id_type _local_num_partition_hardware_id_sides;
+  Real _local_partition_hardware_id_surface_area;
+
   VectorPostprocessorValue & _pid;
   VectorPostprocessorValue & _num_elems;
   VectorPostprocessorValue & _num_nodes;
   VectorPostprocessorValue & _num_dofs;
   VectorPostprocessorValue & _num_partition_sides;
   VectorPostprocessorValue & _partition_surface_area;
+  VectorPostprocessorValue & _num_partition_hardware_id_sides;
+  VectorPostprocessorValue & _partition_hardware_id_surface_area;
 };
 
 #endif // WORKBALANCE_H
diff --git a/framework/src/auxkernels/HardwareIDAux.C b/framework/src/auxkernels/HardwareIDAux.C
@@ -0,0 +1,36 @@
+//* This file is part of the MOOSE framework
+//* https://www.mooseframework.org
+//*
+//* All rights reserved, see COPYRIGHT for full restrictions
+//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
+//*
+//* Licensed under LGPL 2.1, please see LICENSE for details
+//* https://www.gnu.org/licenses/lgpl-2.1.html
+
+#include "HardwareIDAux.h"
+
+registerMooseObject("MooseApp", HardwareIDAux);
+
+template <>
+InputParameters
+validParams<HardwareIDAux>()
+{
+  InputParameters params = validParams<AuxKernel>();
+  params.addClassDescription(
+      "Creates a field showing the assignment of partitions to physical nodes in the cluster.");
+  return params;
+}
+
+HardwareIDAux::HardwareIDAux(const InputParameters & parameters)
+  : AuxKernel(parameters), _rank_map(_app.rankMap())
+{
+}
+
+Real
+HardwareIDAux::computeValue()
+{
+  if (isNodal())
+    return _rank_map.hardwareID(_current_node->processor_id());
+  else
+    return _rank_map.hardwareID(_current_elem->processor_id());
+}
diff --git a/framework/src/base/MooseApp.C b/framework/src/base/MooseApp.C
@@ -260,6 +260,7 @@ MooseApp::MooseApp(InputParameters parameters)
     _pars(parameters),
     _type(getParam<std::string>("_type")),
     _comm(getParam<std::shared_ptr<Parallel::Communicator>>("_comm")),
+    _rank_map(*_comm, _perf_graph),
     _output_position_set(false),
     _start_time_set(false),
     _start_time(0.0),

diff --git a/framework/src/postprocessors/MemoryUsageReporter.C b/framework/src/postprocessors/MemoryUsageReporter.C
@@ -9,12 +9,13 @@
 
 #include "MemoryUsageReporter.h"
 #include "MemoryUtils.h"
+#include "MooseApp.h"
 
 MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
   : _mur_communicator(moose_object->comm()),
     _my_rank(_mur_communicator.rank()),
     _nrank(_mur_communicator.size()),
-    _hardware_id(_nrank)
+    _hardware_id(moose_object->getMooseApp().rankMap().rankHardwareIds())
 {
   // get total available ram
   _memory_total = MemoryUtils::getTotalRAM();
@@ -25,10 +26,9 @@ MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
   std::vector<unsigned long long> memory_totals(_nrank);
   _mur_communicator.gather(0, _memory_total, memory_totals);
 
-  sharedMemoryRanksBySplitCommunicator();
-
   // validate and store per node memory
   if (_my_rank == 0)
+  {
     for (std::size_t i = 0; i < _nrank; ++i)
     {
       auto id = _hardware_id[i];
@@ -41,65 +41,5 @@ MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
         mooseWarning("Inconsistent total memory reported by ranks on the same hardware node in ",
                      moose_object->name());
     }
-}
-
-void
-MemoryUsageReporter::sharedMemoryRanksBySplitCommunicator()
-{
-  // figure out which ranks share memory
-  processor_id_type world_rank = 0;
-#ifdef LIBMESH_HAVE_MPI
-  // create a split communicator among shared memory ranks
-  MPI_Comm shmem_raw_comm;
-  MPI_Comm_split_type(
-      _mur_communicator.get(), MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &shmem_raw_comm);
-  Parallel::Communicator shmem_comm(shmem_raw_comm);
-
-  // broadcast the world rank of the sub group root
-  world_rank = _my_rank;
-  shmem_comm.broadcast(world_rank, 0);
-#endif
-  std::vector<processor_id_type> world_ranks(_nrank);
-  _mur_communicator.gather(0, world_rank, world_ranks);
-
-  // assign a contiguous unique numerical id to each shared memory group on processor zero
-  unsigned int id = 0;
-  processor_id_type last = world_ranks[0];
-  if (_my_rank == 0)
-    for (std::size_t i = 0; i < _nrank; ++i)
-    {
-      if (world_ranks[i] != last)
-      {
-        last = world_ranks[i];
-        id++;
-      }
-      _hardware_id[i] = id;
-    }
-}
-
-void
-MemoryUsageReporter::sharedMemoryRanksByProcessorname()
-{
-  // get processor names and assign a unique number to each piece of hardware
-  std::string processor_name = MemoryUtils::getMPIProcessorName();
-
-  // gather all names at processor zero
-  std::vector<std::string> processor_names(_nrank);
-  _mur_communicator.gather(0, processor_name, processor_names);
-
-  // assign a unique numerical id to them on processor zero
-  unsigned int id = 0;
-  if (_my_rank == 0)
-  {
-    // map to assign an id to each processor name string
-    std::map<std::string, unsigned int> hardware_id_map;
-    for (std::size_t i = 0; i < _nrank; ++i)
-    {
-      // generate or look up unique ID for the current processor name
-      auto it = hardware_id_map.lower_bound(processor_names[i]);
-      if (it == hardware_id_map.end() || it->first != processor_names[i])
-        it = hardware_id_map.emplace_hint(it, processor_names[i], id++);
-      _hardware_id[i] = it->second;
-    }
   }
 }