awslabs · Sbozzolo · Apr 22, 2026 · Apr 16, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,13 +16,18 @@ The format of this changelog is based on
 
 #### New Features
 
+  - Added a new table reporting memory consumption for various stages in the
+    simulation. [PR 708](https://github.com/awslabs/palace/pull/708)
+
   - Expanded JSON schema validation to cover required fields, mutual exclusion constraints
     (e.g., `PEC`/`Ground`, `PMC`/`ZeroCharge`), array type validation, and numeric bounds.
     Many runtime checks are now caught earlier at configuration parsing time with clearer
     error messages [PR 635](https://github.com/awslabs/palace/pull/635).
 
 #### Bug Fixes
 
+  - Reduced memory usage when `MaxIts` for GMRES is larger than the number of
+    required iterations. [PR 715](https://github.com/awslabs/palace/pull/715)
   - Improved IO performance for simulations with Adaptive Mesh Refinement. Now,
     files from previous iterations are moved instead of being copied. The latest
     output is always available at the top level of the output directory (as

diff --git a/docs/src/developer/notes.md b/docs/src/developer/notes.md
@@ -217,6 +217,60 @@ Disk IO                         // < Disk read/write time for loading the mesh f
 Total                           // < Total simulation time
 ```
 
+## Memory Reporting
+
+Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
+per-process snapshots and per-phase growth. The goal of this reporting is to understand the
+maximum memory required by a *Palace* simulation. We do not track how much memory *Palace*
+uses at any given time, instead, we measure how much various phases of the simulation
+increase the high-water mark. This is useful to understand and reduce the total memory
+required by *Palace*. Note, this tool is not enough to comprehensively track the memory
+lifecycle because it does not see allocations/deallocations that do not increase the peak
+RSS.
+
+### Per-process snapshots
+
+The `memory_reporting` utilities (`memoryreporting.hpp`) provide functions for querying the
+current and peak RSS of a process. These are aggregated across MPI ranks (min/max/avg) and
+across nodes (by splitting the communicator with `MPI_Comm_split_type`). Two snapshots are
+printed during each simulation: current memory after mesh loading, and peak memory after the
+solve.
+
+### Per-phase memory growth
+
+The `Timer`/`BlockTimer` system tracks not only elapsed time but also **peak** RSS growth per
+phase. Every `BlockTimer` scope automatically records how much the peak RSS increased during
+that phase, using the same stack-based interruption mechanism as timing: entering a nested
+scope attributes the memory growth so far to the outer scope, then starts tracking the inner
+scope.
+
+The per-phase memory table is printed alongside the timing table at the end of each
+simulation. It has three columns:
+
+  - Per-Node: maximum growth on any single node for this phase.
+  - Total: sum of growth across all nodes for this phase.
+  - Total HWM (High Water Mark): high-water mark growth at the end of this phase summed across all the nodes.
+
+The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
+`false`, both timing and memory tracking are disabled for that scope. This can be used to
+exclude certain sections of code from a `BlockTimer`.
+
+Per-phase memory data is also saved to `palace.json` under the `PeakMemoryGrowthMegabytes`
+and `PeakNodeMemoryGrowthMegabytes` keys, alongside the existing `ElapsedTime` data.
+
+### Interpreting memory data
+
+Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
+always non-negative. A phase that allocates memory temporarily and then frees it will still
+show the growth if the allocation pushed the peak. A phase showing zero growth means it did
+not exceed the previously established peak. Note that this does not mean that no allocations
+were performed.
+
+Per-phase deltas may not sum exactly to the total because memory growth can occur between
+timed phases (e.g., during scope transitions or in code not wrapped by a `BlockTimer`).
+
+You should read each row as "this phase increases the peak memory by this amount".
+
 ## Profiling *Palace* on CPUs
 
 A typical *Palace* simulation spends most of its time in libCEED kernels, which, in turn, executed `libsxmm` code on CPUs. Libsxmm generates code just-in-time to ensure it is the most performant on the given architecture and for the given problem. This code generation confuses most profilers. Luckily, [libsxmm](https://libxsmm.readthedocs.io/en/latest/libxsmm_prof/) can integrate with the VTune APIs to enable profiling of jitted functions as well.

diff --git a/palace/drivers/basesolver.cpp b/palace/drivers/basesolver.cpp
@@ -181,6 +181,7 @@ void BaseSolver::SolveEstimateMarkRefine(std::vector<std::unique_ptr<Mesh>> &mes
   {
     // Print timing summary.
     Mpi::Print(comm, "\nCumulative timing statistics:\n");
+    BlockTimer::Finalize(comm);
     BlockTimer::Print(comm);
     auto peak_mem = memory_reporting::GetPeakMemoryStats(comm);
     auto peak_node_mem = memory_reporting::GetPeakNodeMemoryStats(comm);
@@ -300,13 +301,22 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
 {
   if (root)
   {
+    constexpr double to_mb = 1.0 / (1024.0 * 1024.0);
+    auto red = BlockTimer::GetReductions();
+
     json meta = LoadMetadata(post_dir);
     for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
     {
       auto key = Timer::descriptions[i];
       key.erase(std::remove_if(key.begin(), key.end(), isspace), key.end());
       meta["ElapsedTime"]["Durations"][key] = timer.Data((Timer::Index)i);
       meta["ElapsedTime"]["Counts"][key] = timer.Counts((Timer::Index)i);
+      meta["PeakMemoryGrowthMegabytes"]["Min"][key] = red.rank_mem.min[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Max"][key] = red.rank_mem.max[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Sum"][key] = red.rank_mem.sum[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Min"][key] = red.node_mem.min[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Max"][key] = red.node_mem.max[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Sum"][key] = red.node_mem.sum[i] * to_mb;
     }
     WriteMetadata(post_dir, meta);
   }

diff --git a/palace/main.cpp b/palace/main.cpp
@@ -304,6 +304,7 @@ int main(int argc, char *argv[])
   Mpi::Print(world_comm, "\n");
   memory_reporting::PrintMemoryUsage(world_comm, peak_mem);
   memory_reporting::PrintMemoryUsage(world_comm, peak_node_mem);
+  BlockTimer::Finalize(world_comm);
   BlockTimer::Print(world_comm);
   solver->SaveMetadata(BlockTimer::GlobalTimer());
   solver->SaveMetadata(peak_mem);

diff --git a/palace/utils/memoryreporting.cpp b/palace/utils/memoryreporting.cpp
@@ -83,9 +83,6 @@ long GetPeakMemory()
   return 0;
 }
 
-namespace
-{
-
 std::string FormatBytes(double bytes)
 {
   constexpr double kB = 1024.0;
@@ -107,6 +104,9 @@ std::string FormatBytes(double bytes)
   return fmt::format("{:.1f}K", bytes / kB);
 }
 
+namespace
+{
+
 MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
 {
   long val_min = local_value, val_max = local_value, val_sum = local_value;
@@ -116,7 +116,9 @@ MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
   return {label, val_min, val_max, val_sum, static_cast<double>(val_sum) / Mpi::Size(comm)};
 }
 
-MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
+}  // namespace
+
+MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm)
 {
   // Split communicator into shared memory groups (processes on same node).
   MPI_Comm node_comm;
@@ -165,8 +167,6 @@ MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
   return stats;
 }
 
-}  // namespace
-
 MemoryStats GetCurrentMemoryStats(MPI_Comm comm)
 {
   return ComputeStats("current per-rank", GetCurrentMemory(), comm);
@@ -179,12 +179,12 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm)
 
 MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm)
 {
-  return ComputeNodeStats("current per-node", GetCurrentMemory(), comm);
+  return ComputeNodeMemoryStats("current per-node", GetCurrentMemory(), comm);
 }
 
 MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm)
 {
-  return ComputeNodeStats("peak per-node", GetPeakMemory(), comm);
+  return ComputeNodeMemoryStats("peak per-node", GetPeakMemory(), comm);
 }
 
 void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats)

diff --git a/palace/utils/memoryreporting.hpp b/palace/utils/memoryreporting.hpp
@@ -34,6 +34,13 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm);
 MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm);
 MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm);
 
+// Compute per-node statistics for an arbitrary per-rank value (bytes). Sums across
+// processes sharing a node and reports min/max/avg across nodes.
+MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm);
+
+// Format a byte count as a human-readable string (e.g., "1.5G", "228.0M").
+std::string FormatBytes(double bytes);
+
 // Print memory usage summary.
 void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats);