Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,18 @@ The format of this changelog is based on

#### New Features

- Added a new table reporting memory consumption for various stages in the
simulation. [PR 708](https://github.com/awslabs/palace/pull/708)

- Expanded JSON schema validation to cover required fields, mutual exclusion constraints
(e.g., `PEC`/`Ground`, `PMC`/`ZeroCharge`), array type validation, and numeric bounds.
Many runtime checks are now caught earlier at configuration parsing time with clearer
error messages [PR 635](https://github.com/awslabs/palace/pull/635).

#### Bug Fixes

- Reduced memory usage when `MaxIts` for GMRES is larger than the number of
required iterations. [PR 715](https://github.com/awslabs/palace/pull/715)
- Improved IO performance for simulations with Adaptive Mesh Refinement. Now,
files from previous iterations are moved instead of being copied. The latest
output is always available at the top level of the output directory (as
Expand Down
54 changes: 54 additions & 0 deletions docs/src/developer/notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,60 @@ Disk IO // < Disk read/write time for loading the mesh f
Total // < Total simulation time
```

## Memory Reporting

Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
per-process snapshots and per-phase growth. The goal of this reporting is to understand the
maximum memory required by a *Palace* simulation. We do not track how much memory *Palace*
uses at any given time, instead, we measure how much various phases of the simulation
increase the high-water mark. This is useful to understand and reduce the total memory
required by *Palace*. Note, this tool is not enough to comprehensively track the memory
lifecycle because it does not see allocations/deallocations that do not increase the peak
RSS.

### Per-process snapshots

The `memory_reporting` utilities (`memoryreporting.hpp`) provide functions for querying the
current and peak RSS of a process. These are aggregated across MPI ranks (min/max/avg) and
across nodes (by splitting the communicator with `MPI_Comm_split_type`). Two snapshots are
printed during each simulation: current memory after mesh loading, and peak memory after the
solve.

### Per-phase memory growth

The `Timer`/`BlockTimer` system tracks not only elapsed time but also **peak** RSS growth per
phase. Every `BlockTimer` scope automatically records how much the peak RSS increased during
that phase, using the same stack-based interruption mechanism as timing: entering a nested
scope attributes the memory growth so far to the outer scope, then starts tracking the inner
scope.
Comment thread
hughcars marked this conversation as resolved.

The per-phase memory table is printed alongside the timing table at the end of each
simulation. It has three columns:

- Per-Node: maximum growth on any single node for this phase.
- Total: sum of growth across all nodes for this phase.
- Total HWM (High Water Mark): high-water mark growth at the end of this phase summed across all the nodes.

The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
`false`, both timing and memory tracking are disabled for that scope. This can be used to
exclude certain sections of code from a `BlockTimer`.

Per-phase memory data is also saved to `palace.json` under the `PeakMemoryGrowthMegabytes`
and `PeakNodeMemoryGrowthMegabytes` keys, alongside the existing `ElapsedTime` data.

### Interpreting memory data

Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
always non-negative. A phase that allocates memory temporarily and then frees it will still
show the growth if the allocation pushed the peak. A phase showing zero growth means it did
not exceed the previously established peak. Note that this does not mean that no allocations
were performed.

Per-phase deltas may not sum exactly to the total because memory growth can occur between
timed phases (e.g., during scope transitions or in code not wrapped by a `BlockTimer`).

You should read each row as "this phase increases the peak memory by this amount".

## Profiling *Palace* on CPUs

A typical *Palace* simulation spends most of its time in libCEED kernels, which, in turn, executed `libsxmm` code on CPUs. Libsxmm generates code just-in-time to ensure it is the most performant on the given architecture and for the given problem. This code generation confuses most profilers. Luckily, [libsxmm](https://libxsmm.readthedocs.io/en/latest/libxsmm_prof/) can integrate with the VTune APIs to enable profiling of jitted functions as well.
Expand Down
10 changes: 10 additions & 0 deletions palace/drivers/basesolver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ void BaseSolver::SolveEstimateMarkRefine(std::vector<std::unique_ptr<Mesh>> &mes
{
// Print timing summary.
Mpi::Print(comm, "\nCumulative timing statistics:\n");
BlockTimer::Finalize(comm);
BlockTimer::Print(comm);
auto peak_mem = memory_reporting::GetPeakMemoryStats(comm);
auto peak_node_mem = memory_reporting::GetPeakNodeMemoryStats(comm);
Expand Down Expand Up @@ -300,13 +301,22 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
{
if (root)
{
constexpr double to_mb = 1.0 / (1024.0 * 1024.0);
auto red = BlockTimer::GetReductions();

json meta = LoadMetadata(post_dir);
for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
{
auto key = Timer::descriptions[i];
key.erase(std::remove_if(key.begin(), key.end(), isspace), key.end());
meta["ElapsedTime"]["Durations"][key] = timer.Data((Timer::Index)i);
meta["ElapsedTime"]["Counts"][key] = timer.Counts((Timer::Index)i);
meta["PeakMemoryGrowthMegabytes"]["Min"][key] = red.rank_mem.min[i] * to_mb;
meta["PeakMemoryGrowthMegabytes"]["Max"][key] = red.rank_mem.max[i] * to_mb;
meta["PeakMemoryGrowthMegabytes"]["Sum"][key] = red.rank_mem.sum[i] * to_mb;
meta["PeakNodeMemoryGrowthMegabytes"]["Min"][key] = red.node_mem.min[i] * to_mb;
meta["PeakNodeMemoryGrowthMegabytes"]["Max"][key] = red.node_mem.max[i] * to_mb;
meta["PeakNodeMemoryGrowthMegabytes"]["Sum"][key] = red.node_mem.sum[i] * to_mb;
}
WriteMetadata(post_dir, meta);
}
Expand Down
1 change: 1 addition & 0 deletions palace/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@ int main(int argc, char *argv[])
Mpi::Print(world_comm, "\n");
memory_reporting::PrintMemoryUsage(world_comm, peak_mem);
memory_reporting::PrintMemoryUsage(world_comm, peak_node_mem);
BlockTimer::Finalize(world_comm);
BlockTimer::Print(world_comm);
solver->SaveMetadata(BlockTimer::GlobalTimer());
solver->SaveMetadata(peak_mem);
Expand Down
16 changes: 8 additions & 8 deletions palace/utils/memoryreporting.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,6 @@ long GetPeakMemory()
return 0;
}

namespace
{

std::string FormatBytes(double bytes)
{
constexpr double kB = 1024.0;
Expand All @@ -107,6 +104,9 @@ std::string FormatBytes(double bytes)
return fmt::format("{:.1f}K", bytes / kB);
}

namespace
{

MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
{
long val_min = local_value, val_max = local_value, val_sum = local_value;
Expand All @@ -116,7 +116,9 @@ MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
return {label, val_min, val_max, val_sum, static_cast<double>(val_sum) / Mpi::Size(comm)};
}

MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
} // namespace

MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm)
{
// Split communicator into shared memory groups (processes on same node).
MPI_Comm node_comm;
Expand Down Expand Up @@ -165,8 +167,6 @@ MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
return stats;
}

} // namespace

MemoryStats GetCurrentMemoryStats(MPI_Comm comm)
{
return ComputeStats("current per-rank", GetCurrentMemory(), comm);
Expand All @@ -179,12 +179,12 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm)

MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm)
{
return ComputeNodeStats("current per-node", GetCurrentMemory(), comm);
return ComputeNodeMemoryStats("current per-node", GetCurrentMemory(), comm);
}

MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm)
{
return ComputeNodeStats("peak per-node", GetPeakMemory(), comm);
return ComputeNodeMemoryStats("peak per-node", GetPeakMemory(), comm);
}

void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats)
Expand Down
7 changes: 7 additions & 0 deletions palace/utils/memoryreporting.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm);
MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm);
MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm);

// Compute per-node statistics for an arbitrary per-rank value (bytes). Sums across
// processes sharing a node and reports min/max/avg across nodes.
MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm);

// Format a byte count as a human-readable string (e.g., "1.5G", "228.0M").
std::string FormatBytes(double bytes);

// Print memory usage summary.
void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats);

Expand Down
Loading
Loading