Skip to content

Commit

Permalink
Add RankMap and use it to get some more diagnostic capability for par…
Browse files Browse the repository at this point in the history
…titioning and fix a big in MemoryUsageReporter closes idaholab#12629
  • Loading branch information
friedmud committed Jan 10, 2019
1 parent cc9c7e9 commit ecd632a
Show file tree
Hide file tree
Showing 20 changed files with 418 additions and 80 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 47 additions & 0 deletions framework/doc/content/source/auxkernels/HardwareIDAux.md
@@ -0,0 +1,47 @@
# HardwareIDAux

!syntax description /AuxKernels/HardwareIDAux

## Description

One of the main purposes of this object is to aid in the diagnostic of mesh partitioners. One metric to look at for mesh partitioners is how well they keep down inter-node (compute node) communication. `HardwareIDAux` allows you to visually see the mapping of elements to compute nodes in your job.

This is particularly interesting in the case of the [PetscExternalPartitioner](PetscExternalPartitioner.md) which has the capability to do "hierarchical" partitioning. Hierarchical partitioning makes it possible to partition over compute-nodes first... then within compute nodes, in order to better respect the physical topology of the compute cluster.

One important aspect of that is that how you launch your parallel job can matter quite a bit to partitioning. In-general, it's better for partitioners if all of the ranks of your job are contiguously assigned to each compute node. Here are four different ways, and the outcome using `HardwareIDAux`, to launch a job using a 100x100 generated mesh on 16 processes and 4 ndoes with two different partitioner...

Top left (METIS):

```
mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 ../../../moose_test-opt -i hardware_id_aux.i
```

Top right (Hierarchic):

```
mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 ../../../moose_test-opt -i hardware_id_aux.i -mat_partitioning_hierarchical_nfineparts 4
```

Bottom left (METIS):

```
mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 -ppn 4 ../../../moose_test-opt -i hardware_id_aux.i
```

Bottom right (Hierarchic):

```
mpiexec -n 16 -host lemhi0002,lemhi0003,lemhi0004,lemhi0005 -ppn 4 ../../../moose_test-opt -i hardware_id_aux.i -mat_partitioning_hierarchical_nfineparts 4
```

It should be immediately apparent that the bottom right partitioning is best (will reduce the amount of inter-node communication). That result was achieved by using hierarchical partitioning and using `-ppn 4` to tell `mpiexec` to put `4` processes on each compute node... which will cause those four processes to be contiguous on each node. The top two examples, which omit the `-ppn` option, end up getting "striped" mpi processes (one process is placed on each node and then it wraps around) causing a jumbly mess of partitioning which will increase the communication cost for the job (and decrease scalability).

!media media/auxkernels/hardware_id_aux.png style=width:75%

!syntax parameters /AuxKernels/HardwareIDAux

!syntax inputs /AuxKernels/HardwareIDAux

!syntax children /AuxKernels/HardwareIDAux

!bibtex bibliography
Expand Up @@ -10,11 +10,13 @@ The idea here is to compute per-processor metrics to help in determining the qua

Currently computes: number of local elements, nodes, dofs and partition sides. The partition sides are the sides of elements that are on processor boundaries (also known as the "edge-cuts" in partitioner lingo). Also computes the "surface area" of each partition (physically, how much processor boundary each partitioning has).

## Important Notes
### HardwareID

Note that, by default, this VPP only computes the complete vector on processor 0. The vectors this VPP computes may be very large and there is no need to have a copy of them on every processor.
`WorkBalance` will now also compute the number of sides and the surface area for the partition on each compute node (called "hardware_id" here) in the cluster. This gives the amount of "inter-node" communication. Use of a hierarchical partitioner (like the one available in [PetscExternalPartitioner](PetscExternalPartitioner.md)) can help reduce inter-node communication.

However, you can modify this behavior by setting `sync_to_all_procs = true`
For instance, here is a 1600x1600 mesh partitioned to run on 64 nodes, each having 36 processors (2304 processors total). Using `WorkBalance` and [VectorPostprocessorVisualizationAux](VectorPostprocessorVisualizationAux.md) we can visually see how much inter-node communication there is and quantify it.

!media media/vectorpostprocessors/work_balance_hardware_id.png style=width:75% caption=Visualization of inter-node communication. Left: Parmetis, Right: Hierarchical. Parmetis `hardware_id_surface_area`: 66. Hierarchical `hardware_id_surface_area`: 39.

!syntax parameters /VectorPostprocessors/WorkBalance

Expand Down
36 changes: 36 additions & 0 deletions framework/include/auxkernels/HardwareIDAux.h
@@ -0,0 +1,36 @@
//* This file is part of the MOOSE framework
//* https://www.mooseframework.org
//*
//* All rights reserved, see COPYRIGHT for full restrictions
//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
//*
//* Licensed under LGPL 2.1, please see LICENSE for details
//* https://www.gnu.org/licenses/lgpl-2.1.html

#ifndef HARDWAREIDAUX_H
#define HARDWAREIDAUX_H

#include "AuxKernel.h"

// Forward Declarations
class HardwareIDAux;

template <>
InputParameters validParams<HardwareIDAux>();

/**
* "Paints" the ID of of the physical "node" in the cluster the element
* is located on. Useful for examining partition schemes.
*/
class HardwareIDAux : public AuxKernel
{
public:
HardwareIDAux(const InputParameters & parameters);

protected:
virtual Real computeValue() override;

const RankMap & _rank_map;
};

#endif // HARDWAREIDAUX_H
10 changes: 10 additions & 0 deletions framework/include/base/MooseApp.h
Expand Up @@ -21,6 +21,7 @@
#include "ConsoleStreamInterface.h"
#include "PerfGraph.h"
#include "TheWarehouse.h"
#include "RankMap.h"

#include "libmesh/parallel_object.h"
#include "libmesh/mesh_base.h"
Expand Down Expand Up @@ -91,6 +92,12 @@ class MooseApp : public ConsoleStreamInterface, public libMesh::ParallelObject
*/
const std::string & type() const { return _type; }

/**
* The RankMap is a useful object for determining how the processes
* are laid out on the physical nodes of the cluster
*/
const RankMap & rankMap() { return _rank_map; }

/**
* Get the PerfGraph for this app
*/
Expand Down Expand Up @@ -704,6 +711,9 @@ class MooseApp : public ConsoleStreamInterface, public libMesh::ParallelObject
/// The PerfGraph object for this applciation
PerfGraph _perf_graph;

/// The RankMap is a useful object for determining how
const RankMap _rank_map;

/// Input file name used
std::string _input_filename;

Expand Down
2 changes: 1 addition & 1 deletion framework/include/postprocessors/MemoryUsageReporter.h
Expand Up @@ -33,7 +33,7 @@ class MemoryUsageReporter
processor_id_type _nrank;

/// hardware IDs for each MPI rank (valid on rank zero only)
std::vector<unsigned int> _hardware_id;
const std::vector<unsigned int> & _hardware_id;

/// total RAM installed in the local node
unsigned long long _memory_total;
Expand Down
65 changes: 65 additions & 0 deletions framework/include/utils/RankMap.h
@@ -0,0 +1,65 @@
//* This file is part of the MOOSE framework
//* https://www.mooseframework.org
//*
//* All rights reserved, see COPYRIGHT for full restrictions
//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
//*
//* Licensed under LGPL 2.1, please see LICENSE for details
//* https://www.gnu.org/licenses/lgpl-2.1.html

#ifndef RANKMAP_H
#define RANKMAP_H

#include "PerfGraphInterface.h"

#include "libmesh/parallel_object.h"

/**
* Builds lists and maps that help in knowing which physical hardware nodes each rank is on.
*
* Note: large chunks of this code were originally committed by @dschwen in PR #12351
*
* https://github.com/idaholab/moose/pull/12351
*/
class RankMap : ParallelObject, PerfGraphInterface
{
public:
/**
* Constructs and fills the map
*/
RankMap(const Parallel::Communicator & comm, PerfGraph & perf_graph);

/**
* Returns the "hardware ID" (a unique ID given to each physical compute node in the job)
* for a given processor ID (rank)
*/
unsigned int hardwareID(processor_id_type pid) const { return _rank_to_hardware_id[pid]; }

/**
* Returns the ranks that are on the given hardwareID (phsical node in the job)
*/
const std::vector<processor_id_type> & ranks(unsigned int hardware_id) const
{
auto item = _hardware_id_to_ranks.find(hardware_id);
if (item == _hardware_id_to_ranks.end())
mooseError("hardware_id not found");

return item->second;
}

/**
* Vector containing the hardware ID for each PID
*/
const std::vector<unsigned int> & rankHardwareIds() const { return _rank_to_hardware_id; }

protected:
PerfID _construct_timer;

/// Map of hardware_id -> ranks on that node
std::map<unsigned int, std::vector<processor_id_type>> _hardware_id_to_ranks;

/// Each entry corresponds to the hardware_id for that PID
std::vector<unsigned int> _rank_to_hardware_id;
};

#endif
11 changes: 11 additions & 0 deletions framework/include/vectorpostprocessors/WorkBalance.h
Expand Up @@ -43,6 +43,11 @@ class WorkBalance : public GeneralVectorPostprocessor
/// The system to count DoFs from
int _system;

/// Helpful in determining the physical layout of the ranks
const RankMap & _rank_map;

unsigned int _my_hardware_id;

bool _sync_to_all_procs;

dof_id_type _local_num_elems;
Expand All @@ -51,12 +56,18 @@ class WorkBalance : public GeneralVectorPostprocessor
dof_id_type _local_num_partition_sides;
Real _local_partition_surface_area;

// These are measuring the size of inter-nodal (compute nodes) communication
dof_id_type _local_num_partition_hardware_id_sides;
Real _local_partition_hardware_id_surface_area;

VectorPostprocessorValue & _pid;
VectorPostprocessorValue & _num_elems;
VectorPostprocessorValue & _num_nodes;
VectorPostprocessorValue & _num_dofs;
VectorPostprocessorValue & _num_partition_sides;
VectorPostprocessorValue & _partition_surface_area;
VectorPostprocessorValue & _num_partition_hardware_id_sides;
VectorPostprocessorValue & _partition_hardware_id_surface_area;
};

#endif // WORKBALANCE_H
36 changes: 36 additions & 0 deletions framework/src/auxkernels/HardwareIDAux.C
@@ -0,0 +1,36 @@
//* This file is part of the MOOSE framework
//* https://www.mooseframework.org
//*
//* All rights reserved, see COPYRIGHT for full restrictions
//* https://github.com/idaholab/moose/blob/master/COPYRIGHT
//*
//* Licensed under LGPL 2.1, please see LICENSE for details
//* https://www.gnu.org/licenses/lgpl-2.1.html

#include "HardwareIDAux.h"

registerMooseObject("MooseApp", HardwareIDAux);

template <>
InputParameters
validParams<HardwareIDAux>()
{
InputParameters params = validParams<AuxKernel>();
params.addClassDescription(
"Creates a field showing the assignment of partitions to physical nodes in the cluster.");
return params;
}

HardwareIDAux::HardwareIDAux(const InputParameters & parameters)
: AuxKernel(parameters), _rank_map(_app.rankMap())
{
}

Real
HardwareIDAux::computeValue()
{
if (isNodal())
return _rank_map.hardwareID(_current_node->processor_id());
else
return _rank_map.hardwareID(_current_elem->processor_id());
}
1 change: 1 addition & 0 deletions framework/src/base/MooseApp.C
Expand Up @@ -260,6 +260,7 @@ MooseApp::MooseApp(InputParameters parameters)
_pars(parameters),
_type(getParam<std::string>("_type")),
_comm(getParam<std::shared_ptr<Parallel::Communicator>>("_comm")),
_rank_map(*_comm, _perf_graph),
_output_position_set(false),
_start_time_set(false),
_start_time(0.0),
Expand Down
66 changes: 3 additions & 63 deletions framework/src/postprocessors/MemoryUsageReporter.C
Expand Up @@ -9,12 +9,13 @@

#include "MemoryUsageReporter.h"
#include "MemoryUtils.h"
#include "MooseApp.h"

MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
: _mur_communicator(moose_object->comm()),
_my_rank(_mur_communicator.rank()),
_nrank(_mur_communicator.size()),
_hardware_id(_nrank)
_hardware_id(moose_object->getMooseApp().rankMap().rankHardwareIds())
{
// get total available ram
_memory_total = MemoryUtils::getTotalRAM();
Expand All @@ -25,10 +26,9 @@ MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
std::vector<unsigned long long> memory_totals(_nrank);
_mur_communicator.gather(0, _memory_total, memory_totals);

sharedMemoryRanksBySplitCommunicator();

// validate and store per node memory
if (_my_rank == 0)
{
for (std::size_t i = 0; i < _nrank; ++i)
{
auto id = _hardware_id[i];
Expand All @@ -41,65 +41,5 @@ MemoryUsageReporter::MemoryUsageReporter(const MooseObject * moose_object)
mooseWarning("Inconsistent total memory reported by ranks on the same hardware node in ",
moose_object->name());
}
}

void
MemoryUsageReporter::sharedMemoryRanksBySplitCommunicator()
{
// figure out which ranks share memory
processor_id_type world_rank = 0;
#ifdef LIBMESH_HAVE_MPI
// create a split communicator among shared memory ranks
MPI_Comm shmem_raw_comm;
MPI_Comm_split_type(
_mur_communicator.get(), MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &shmem_raw_comm);
Parallel::Communicator shmem_comm(shmem_raw_comm);

// broadcast the world rank of the sub group root
world_rank = _my_rank;
shmem_comm.broadcast(world_rank, 0);
#endif
std::vector<processor_id_type> world_ranks(_nrank);
_mur_communicator.gather(0, world_rank, world_ranks);

// assign a contiguous unique numerical id to each shared memory group on processor zero
unsigned int id = 0;
processor_id_type last = world_ranks[0];
if (_my_rank == 0)
for (std::size_t i = 0; i < _nrank; ++i)
{
if (world_ranks[i] != last)
{
last = world_ranks[i];
id++;
}
_hardware_id[i] = id;
}
}

void
MemoryUsageReporter::sharedMemoryRanksByProcessorname()
{
// get processor names and assign a unique number to each piece of hardware
std::string processor_name = MemoryUtils::getMPIProcessorName();

// gather all names at processor zero
std::vector<std::string> processor_names(_nrank);
_mur_communicator.gather(0, processor_name, processor_names);

// assign a unique numerical id to them on processor zero
unsigned int id = 0;
if (_my_rank == 0)
{
// map to assign an id to each processor name string
std::map<std::string, unsigned int> hardware_id_map;
for (std::size_t i = 0; i < _nrank; ++i)
{
// generate or look up unique ID for the current processor name
auto it = hardware_id_map.lower_bound(processor_names[i]);
if (it == hardware_id_map.end() || it->first != processor_names[i])
it = hardware_id_map.emplace_hint(it, processor_names[i], id++);
_hardware_id[i] = it->second;
}
}
}

0 comments on commit ecd632a

Please sign in to comment.