From 0fb6f55d006bc3e933392e4e40083b7162fc1626 Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Thu, 16 Apr 2026 14:08:55 -0700
Subject: [PATCH 1/7] Add more detailed memory reporting

I expanded the timer system to collect memory information.

A new table is printed at the end of each run:
```
Peak Memory Growth per Rank        Min.        Max.        Tot.
==============================================================
Initialization                    3.0M        9.0M        1.1G
  Mesh Preprocessing              0.0K       40.9M      391.9M
Operator Construction            12.0M       33.0M        3.1G
Linear Solve                      0.0K        4.5M      736.5M
  Setup                           1.5M        7.5M     1021.5M
  Preconditioner                  0.0K        1.5M       31.5M
  Coarse Solve                   86.7M      170.2M       25.2G
Eigenvalue Solve                  0.0K        3.0M      225.5M
Div.-Free Projection              6.0M       79.5M        6.0G
Estimation                        0.0K        0.0K        0.0K
  Construction                    0.0K        4.5M      312.0M
  Solve                           0.0K        0.0K        0.0K
Postprocessing                    0.0K        1.5M        4.5M
  Paraview                        0.0K        0.0K        0.0K
  Grid function                   0.0K        0.0K        0.0K
Disk IO                           3.0M       48.0M      621.0M
--------------------------------------------------------------
Total                           244.8M      303.6M       49.4G
```

Each row of this table should be interpreted as "this phase pushed the
peak memory by this amount".

This information is also added to the `palace.json`.

I collect both per-rank and per-node information. The above table
reports per-rank information when there is only one node, otherwise it
reports per-node.
---
 CHANGELOG.md                       |   3 +
 docs/src/developer/notes.md        |  46 +++++
 palace/drivers/basesolver.cpp      |  15 ++
 palace/main.cpp                    |   1 +
 palace/utils/memoryreporting.cpp   |  16 +-
 palace/utils/memoryreporting.hpp   |   7 +
 palace/utils/timer.hpp             | 264 +++++++++++++++++++++++------
 test/unit/test-memoryreporting.cpp |  48 ++++++
 8 files changed, 343 insertions(+), 57 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 6006fb4a54..add4118a9e 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -16,6 +16,9 @@ The format of this changelog is based on
 
 #### New Features
 
+  - Added a new table reporting memory consumption for various stages in the
+    simulation. [PR 708](https://github.com/awslabs/palace/pull/708)
+
   - Expanded JSON schema validation to cover required fields, mutual exclusion constraints
     (e.g., `PEC`/`Ground`, `PMC`/`ZeroCharge`), array type validation, and numeric bounds.
     Many runtime checks are now caught earlier at configuration parsing time with clearer
diff --git a/docs/src/developer/notes.md b/docs/src/developer/notes.md
index 496fcc645d..2b05871f6b 100644
--- a/docs/src/developer/notes.md
+++ b/docs/src/developer/notes.md
@@ -217,6 +217,52 @@ Disk IO                         // < Disk read/write time for loading the mesh f
 Total                           // < Total simulation time
 ```
 
+## Memory Reporting
+
+Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
+per-process snapshots and per-phase growth.
+
+### Per-process snapshots
+
+The `memory_reporting` utilities (`memoryreporting.hpp`) provide functions for querying the
+current and peak RSS of a process. These are aggregated across MPI ranks (min/max/avg) and
+across nodes (by splitting the communicator with `MPI_Comm_split_type`). Two snapshots are
+printed during each simulation: current memory after mesh loading, and peak memory after the
+solve.
+
+### Per-phase memory growth
+
+The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per
+phase. Every `BlockTimer` scope automatically records how much the peak RSS increased during
+that phase, using the same stack-based interruption mechanism as timing: entering a nested
+scope attributes the memory growth so far to the outer scope, then starts tracking the inner
+scope.
+
+The per-phase memory table is printed alongside the timing table at the end of each
+simulation. On a single node, the table shows per-rank statistics with min/max/total (sum
+across all ranks). On multiple nodes, it shows per-node statistics (sum of ranks within each
+node) with min/max/total across nodes.
+
+The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
+`false`, both timing and memory tracking are disabled for that scope. This is used in tight
+loops (e.g., preconditioner application, coarse solve within iterative solvers) to avoid the
+overhead of `getrusage()` system calls.
+
+Per-phase memory data is also saved to `palace.json` under the `PeakMemoryGrowthMegabytes`
+and `PeakNodeMemoryGrowthMegabytes` keys, alongside the existing `ElapsedTime` data.
+
+### Interpreting memory data
+
+Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
+always non-negative. A phase that allocates memory temporarily and then frees it will still
+show the growth if the allocation pushed the peak. A phase showing zero growth means it did
+not exceed the previously established peak.
+
+Per-phase deltas may not sum exactly to the total because memory growth can occur between
+timed phases (e.g., during scope transitions or in code not wrapped by a `BlockTimer`).
+
+You should read each row as "this phase increases the peak memory by this amount".
+
 ## Profiling *Palace* on CPUs
 
 A typical *Palace* simulation spends most of its time in libCEED kernels, which, in turn, executed `libsxmm` code on CPUs. Libsxmm generates code just-in-time to ensure it is the most performant on the given architecture and for the given problem. This code generation confuses most profilers. Luckily, [libsxmm](https://libxsmm.readthedocs.io/en/latest/libxsmm_prof/) can integrate with the VTune APIs to enable profiling of jitted functions as well.
diff --git a/palace/drivers/basesolver.cpp b/palace/drivers/basesolver.cpp
index cbce2f0d2f..c6a9e313d0 100644
--- a/palace/drivers/basesolver.cpp
+++ b/palace/drivers/basesolver.cpp
@@ -181,6 +181,7 @@ void BaseSolver::SolveEstimateMarkRefine(std::vector<std::unique_ptr<Mesh>> &mes
   {
     // Print timing summary.
     Mpi::Print(comm, "\nCumulative timing statistics:\n");
+    BlockTimer::Finalize(comm);
     BlockTimer::Print(comm);
     auto peak_mem = memory_reporting::GetPeakMemoryStats(comm);
     auto peak_node_mem = memory_reporting::GetPeakNodeMemoryStats(comm);
@@ -300,6 +301,14 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
 {
   if (root)
   {
+    constexpr double to_mb = 1.0 / (1024.0 * 1024.0);
+    const auto &rank_min = BlockTimer::RankMemoryMin();
+    const auto &rank_max = BlockTimer::RankMemoryMax();
+    const auto &rank_sum = BlockTimer::RankMemorySum();
+    const auto &node_min = BlockTimer::NodeMemoryMin();
+    const auto &node_max = BlockTimer::NodeMemoryMax();
+    const auto &node_sum = BlockTimer::NodeMemorySum();
+
     json meta = LoadMetadata(post_dir);
     for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
     {
@@ -307,6 +316,12 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
       key.erase(std::remove_if(key.begin(), key.end(), isspace), key.end());
       meta["ElapsedTime"]["Durations"][key] = timer.Data((Timer::Index)i);
       meta["ElapsedTime"]["Counts"][key] = timer.Counts((Timer::Index)i);
+      meta["PeakMemoryGrowthMegabytes"]["Min"][key] = rank_min[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Max"][key] = rank_max[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Sum"][key] = rank_sum[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Min"][key] = node_min[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Max"][key] = node_max[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Sum"][key] = node_sum[i] * to_mb;
     }
     WriteMetadata(post_dir, meta);
   }
diff --git a/palace/main.cpp b/palace/main.cpp
index aba6a70dcd..15d0f7dc94 100644
--- a/palace/main.cpp
+++ b/palace/main.cpp
@@ -304,6 +304,7 @@ int main(int argc, char *argv[])
   Mpi::Print(world_comm, "\n");
   memory_reporting::PrintMemoryUsage(world_comm, peak_mem);
   memory_reporting::PrintMemoryUsage(world_comm, peak_node_mem);
+  BlockTimer::Finalize(world_comm);
   BlockTimer::Print(world_comm);
   solver->SaveMetadata(BlockTimer::GlobalTimer());
   solver->SaveMetadata(peak_mem);
diff --git a/palace/utils/memoryreporting.cpp b/palace/utils/memoryreporting.cpp
index b2e70973f9..be29fd0f97 100644
--- a/palace/utils/memoryreporting.cpp
+++ b/palace/utils/memoryreporting.cpp
@@ -83,9 +83,6 @@ long GetPeakMemory()
   return 0;
 }
 
-namespace
-{
-
 std::string FormatBytes(double bytes)
 {
   constexpr double kB = 1024.0;
@@ -107,6 +104,9 @@ std::string FormatBytes(double bytes)
   return fmt::format("{:.1f}K", bytes / kB);
 }
 
+namespace
+{
+
 MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
 {
   long val_min = local_value, val_max = local_value, val_sum = local_value;
@@ -116,7 +116,9 @@ MemoryStats ComputeStats(std::string label, long local_value, MPI_Comm comm)
   return {label, val_min, val_max, val_sum, static_cast<double>(val_sum) / Mpi::Size(comm)};
 }
 
-MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
+}  // namespace
+
+MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm)
 {
   // Split communicator into shared memory groups (processes on same node).
   MPI_Comm node_comm;
@@ -165,8 +167,6 @@ MemoryStats ComputeNodeStats(std::string label, long local_value, MPI_Comm comm)
   return stats;
 }
 
-}  // namespace
-
 MemoryStats GetCurrentMemoryStats(MPI_Comm comm)
 {
   return ComputeStats("current per-rank", GetCurrentMemory(), comm);
@@ -179,12 +179,12 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm)
 
 MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm)
 {
-  return ComputeNodeStats("current per-node", GetCurrentMemory(), comm);
+  return ComputeNodeMemoryStats("current per-node", GetCurrentMemory(), comm);
 }
 
 MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm)
 {
-  return ComputeNodeStats("peak per-node", GetPeakMemory(), comm);
+  return ComputeNodeMemoryStats("peak per-node", GetPeakMemory(), comm);
 }
 
 void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats)
diff --git a/palace/utils/memoryreporting.hpp b/palace/utils/memoryreporting.hpp
index 8ef46e52c2..c6be236bbc 100644
--- a/palace/utils/memoryreporting.hpp
+++ b/palace/utils/memoryreporting.hpp
@@ -34,6 +34,13 @@ MemoryStats GetPeakMemoryStats(MPI_Comm comm);
 MemoryStats GetCurrentNodeMemoryStats(MPI_Comm comm);
 MemoryStats GetPeakNodeMemoryStats(MPI_Comm comm);
 
+// Compute per-node statistics for an arbitrary per-rank value (bytes). Sums across
+// processes sharing a node and reports min/max/avg across nodes.
+MemoryStats ComputeNodeMemoryStats(std::string label, long local_value, MPI_Comm comm);
+
+// Format a byte count as a human-readable string (e.g., "1.5G", "228.0M").
+std::string FormatBytes(double bytes);
+
 // Print memory usage summary.
 void PrintMemoryUsage(MPI_Comm comm, const MemoryStats &stats);
 
diff --git a/palace/utils/timer.hpp b/palace/utils/timer.hpp
index 60686cfafa..351ff763a3 100644
--- a/palace/utils/timer.hpp
+++ b/palace/utils/timer.hpp
@@ -7,14 +7,16 @@
 #include <chrono>
 #include <stack>
 #include <string>
+#include <tuple>
 #include <vector>
 #include "utils/communication.hpp"
+#include "utils/memoryreporting.hpp"
 
 namespace palace
 {
 
 //
-// Timer classes for profiling.
+// Timer classes for profiling time and peak memory growth.
 //
 
 class Timer
@@ -86,10 +88,15 @@ class Timer
   TimePoint last_lap_time;
   std::vector<Duration> data;
   std::vector<int> counts;
+  long start_memory;
+  long last_memory;
+  std::vector<long> mem_data;
 
 public:
   Timer()
-    : start_time(Now()), last_lap_time(start_time), data(NUM_TIMINGS), counts(NUM_TIMINGS)
+    : start_time(Now()), last_lap_time(start_time), data(NUM_TIMINGS), counts(NUM_TIMINGS),
+      start_memory(memory_reporting::GetPeakMemory()), last_memory(start_memory),
+      mem_data(NUM_TIMINGS, 0)
   {
   }
 
@@ -133,6 +140,38 @@ class Timer
 
   // Return number of times timer.MarkTime(idx) or TimerBlock b(idx) was called.
   auto Counts(Index idx) const { return counts[idx]; }
+
+  // Snapshot peak RSS and return delta from last snapshot.
+  long MemoryLap()
+  {
+    long current = memory_reporting::GetPeakMemory();
+    long delta = current - last_memory;
+    last_memory = current;
+    return delta;
+  }
+
+  // Return peak RSS growth since timer creation.
+  long MemoryFromStart() const { return memory_reporting::GetPeakMemory() - start_memory; }
+
+  // Lap and record a memory delta for the given phase.
+  long MarkMemory(Index idx) { return MarkMemory(idx, MemoryLap()); }
+
+  // Record a given memory delta for the given phase (without lapping).
+  long MarkMemory(Index idx, long delta)
+  {
+    if (idx == Timer::TOTAL)
+    {
+      mem_data[idx] = delta;
+    }
+    else
+    {
+      mem_data[idx] += delta;
+    }
+    return mem_data[idx];
+  }
+
+  // Provide read-only access to the memory data (bytes) for a given phase.
+  auto MemoryData(Index idx) const { return mem_data[idx]; }
 };
 
 class BlockTimer
@@ -144,26 +183,45 @@ class BlockTimer
   inline static std::stack<Index> stack;
   bool count;
 
-  // Reduce timing information across MPI ranks.
-  static void Reduce(MPI_Comm comm, std::vector<double> &data_min,
-                     std::vector<double> &data_max, std::vector<double> &data_avg)
-  {
-    data_min.resize(Timer::NUM_TIMINGS);
-    data_max.resize(Timer::NUM_TIMINGS);
-    data_avg.resize(Timer::NUM_TIMINGS);
-    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
-    {
-      data_min[i] = data_max[i] = data_avg[i] = timer.Data((Timer::Index)i);
-    }
-
-    Mpi::GlobalMin(Timer::NUM_TIMINGS, data_min.data(), comm);
-    Mpi::GlobalMax(Timer::NUM_TIMINGS, data_max.data(), comm);
-    Mpi::GlobalSum(Timer::NUM_TIMINGS, data_avg.data(), comm);
+  // Stored reduction results (populated by Finalize).
+  inline static std::vector<double> reduced_time_min;
+  inline static std::vector<double> reduced_time_max;
+  inline static std::vector<double> reduced_time_avg;
+  inline static std::vector<double> reduced_mem_min;
+  inline static std::vector<double> reduced_mem_max;
+  inline static std::vector<double> reduced_mem_sum;
+  inline static std::vector<double> reduced_node_mem_min;
+  inline static std::vector<double> reduced_node_mem_max;
+  inline static std::vector<double> reduced_node_mem_sum;
+  inline static int num_nodes = 0;
 
-    const int np = Mpi::Size(comm);
+  // Print a summary table with three columns. The row_fn callback produces the three
+  // column values for a given timer index.
+  template <typename RowFn>
+  static void PrintTable(MPI_Comm comm, const std::string &title, const std::string &col1,
+                         const std::string &col2, const std::string &col3, RowFn &&row_fn)
+  {
+    constexpr int w = 12;  // Data column width
+    constexpr int h = 26;  // Left-hand side width
+    // clang-format off
+    Mpi::Print(comm, "\n{:<{}s}{:>{}s}{:>{}s}{:>{}s}\n",
+               title, h, col1, w, col2, w, col3, w);
+    // clang-format on
+    Mpi::Print(comm, "{}\n", std::string(h + 3 * w, '='));
     for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
     {
-      data_avg[i] /= np;
+      if (timer.Counts((Timer::Index)i) > 0)
+      {
+        if (i == Timer::TOTAL)
+        {
+          Mpi::Print(comm, "{}\n", std::string(h + 3 * w, '-'));
+        }
+        auto [v1, v2, v3] = row_fn(i);
+        // clang-format off
+        Mpi::Print(comm, "{:<{}s}{:>{}s}{:>{}s}{:>{}s}\n",
+                   timer.descriptions[i], h, v1, w, v2, w, v3, w);
+        // clang-format on
+      }
     }
   }
 
@@ -171,21 +229,29 @@ class BlockTimer
   BlockTimer(Index i, bool count = true) : count(count)
   {
     // Start timing when entering the block, interrupting whatever we were timing before.
-    // Take note of what we are now timing.
     if (count)
     {
-      stack.empty() ? timer.Lap() : timer.MarkTime(stack.top(), false);
+      if (stack.empty())
+      {
+        timer.Lap();
+        timer.MemoryLap();
+      }
+      else
+      {
+        timer.MarkTime(stack.top(), false);
+        timer.MarkMemory(stack.top(), timer.MemoryLap());
+      }
       stack.push(i);
     }
   }
 
   ~BlockTimer()
   {
-    // When a BlockTimer is no longer in scope, record the time (check whether stack is
-    // empty in case the timer has already been finalized).
+    // When a BlockTimer is no longer in scope, record the time and memory growth.
     if (count && !stack.empty())
     {
       timer.MarkTime(stack.top());
+      timer.MarkMemory(stack.top());
       stack.pop();
     }
   }
@@ -193,44 +259,144 @@ class BlockTimer
   // Read-only access the static Timer object.
   static const Timer &GlobalTimer() { return timer; }
 
-  // Print timing information after reducing the data across all processes.
-  static void Print(MPI_Comm comm)
+  // Access stored per-rank memory reduction results (populated by Finalize).
+  static const std::vector<double> &RankMemoryMin()
   {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_mem_min;
+  }
+  static const std::vector<double> &RankMemoryMax()
+  {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_mem_max;
+  }
+  static const std::vector<double> &RankMemorySum()
+  {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_mem_sum;
+  }
+
+  // Access stored per-node memory reduction results (populated by Finalize).
+  static const std::vector<double> &NodeMemoryMin()
+  {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_node_mem_min;
+  }
+  static const std::vector<double> &NodeMemoryMax()
+  {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_node_mem_max;
+  }
+  static const std::vector<double> &NodeMemorySum()
+  {
+    MFEM_VERIFY(IsFinalized(),
+                "BlockTimer::Finalize() must be called before accessing results!");
+    return reduced_node_mem_sum;
+  }
+
+  // Finalize timers and perform MPI reductions. Must be called before Print().
+  static void Finalize(MPI_Comm comm)
+  {
+    // Drain any open timers.
     while (!stack.empty())
     {
       timer.MarkTime(stack.top());
+      timer.MarkMemory(stack.top());
       stack.pop();
     }
     timer.MarkTime(Timer::TOTAL, timer.TimeFromStart());
+    timer.MarkMemory(Timer::TOTAL, timer.MemoryFromStart());
 
-    // Reduce timing data.
-    std::vector<double> data_min, data_max, data_avg;
-    Reduce(comm, data_min, data_max, data_avg);
+    // Reduce timing data across ranks.
+    reduced_time_min.resize(Timer::NUM_TIMINGS);
+    reduced_time_max.resize(Timer::NUM_TIMINGS);
+    reduced_time_avg.resize(Timer::NUM_TIMINGS);
+    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    {
+      reduced_time_min[i] = reduced_time_max[i] = reduced_time_avg[i] =
+          timer.Data((Timer::Index)i);
+    }
+    Mpi::GlobalMin(Timer::NUM_TIMINGS, reduced_time_min.data(), comm);
+    Mpi::GlobalMax(Timer::NUM_TIMINGS, reduced_time_max.data(), comm);
+    Mpi::GlobalSum(Timer::NUM_TIMINGS, reduced_time_avg.data(), comm);
+    const int np = Mpi::Size(comm);
+    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    {
+      reduced_time_avg[i] /= np;
+    }
 
-    // Print a nice table of the timing data.
-    constexpr int p = 3;   // Floating point precision
-    constexpr int w = 12;  // Data column width
-    constexpr int h = 26;  // Left-hand side width
-    // clang-format off
-    Mpi::Print(comm, "\n{:<{}s}{:>{}s}{:>{}s}{:>{}s}\n",
-               "Elapsed Time Report (s)", h, "Min.", w, "Max.", w, "Avg.", w);
-    // clang-format on
-    Mpi::Print(comm, "{}\n", std::string(h + 3 * w, '='));
+    // Reduce per-rank memory data across ranks.
+    reduced_mem_min.resize(Timer::NUM_TIMINGS);
+    reduced_mem_max.resize(Timer::NUM_TIMINGS);
+    reduced_mem_sum.resize(Timer::NUM_TIMINGS);
     for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
     {
-      if (timer.Counts((Timer::Index)i) > 0)
-      {
-        if (i == Timer::TOTAL)
-        {
-          Mpi::Print(comm, "{}\n", std::string(h + 3 * w, '-'));
-        }
-        // clang-format off
-        Mpi::Print(comm, "{:<{}s}{:{}.{}f}{:{}.{}f}{:{}.{}f}\n",
-                   timer.descriptions[i], h,
-                   data_min[i], w, p, data_max[i], w, p, data_avg[i], w, p);
-        // clang-format on
-      }
+      reduced_mem_min[i] = reduced_mem_max[i] = reduced_mem_sum[i] =
+          static_cast<double>(timer.MemoryData((Timer::Index)i));
     }
+    Mpi::GlobalMin(Timer::NUM_TIMINGS, reduced_mem_min.data(), comm);
+    Mpi::GlobalMax(Timer::NUM_TIMINGS, reduced_mem_max.data(), comm);
+    Mpi::GlobalSum(Timer::NUM_TIMINGS, reduced_mem_sum.data(), comm);
+
+    // Reduce per-node memory data across nodes.
+    reduced_node_mem_min.resize(Timer::NUM_TIMINGS);
+    reduced_node_mem_max.resize(Timer::NUM_TIMINGS);
+    reduced_node_mem_sum.resize(Timer::NUM_TIMINGS);
+    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    {
+      auto stats = memory_reporting::ComputeNodeMemoryStats(
+          "", timer.MemoryData((Timer::Index)i), comm);
+      reduced_node_mem_min[i] = static_cast<double>(stats.min);
+      reduced_node_mem_max[i] = static_cast<double>(stats.max);
+      reduced_node_mem_sum[i] = static_cast<double>(stats.sum);
+    }
+
+    // Count nodes.
+    MPI_Comm node_comm;
+    MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &node_comm);
+    int node_rank = Mpi::Rank(node_comm);
+    MPI_Comm_free(&node_comm);
+    int is_leader = (node_rank == 0) ? 1 : 0;
+    num_nodes = 0;
+    MPI_Allreduce(&is_leader, &num_nodes, 1, MPI_INT, MPI_SUM, comm);
+  }
+
+  // Whether Finalize has been called.
+  static bool IsFinalized() { return !reduced_time_min.empty(); }
+
+  // Print timing and memory tables from stored reduction results.
+  static void Print(MPI_Comm comm)
+  {
+    MFEM_VERIFY(IsFinalized(), "BlockTimer::Finalize() must be called before Print()!");
+    // Timing table.
+    constexpr int p = 3;
+    PrintTable(comm, "Elapsed Time Report (s)", "Min.", "Max.", "Avg.",
+               [&](int i) -> std::tuple<std::string, std::string, std::string>
+               {
+                 return {fmt::format("{:.{}f}", reduced_time_min[i], p),
+                         fmt::format("{:.{}f}", reduced_time_max[i], p),
+                         fmt::format("{:.{}f}", reduced_time_avg[i], p)};
+               });
+
+    // Memory table. Single node: per-rank min/max/total. Multi-node: per-node
+    // min/max/total.
+    const auto &m_min = (num_nodes == 1) ? reduced_mem_min : reduced_node_mem_min;
+    const auto &m_max = (num_nodes == 1) ? reduced_mem_max : reduced_node_mem_max;
+    const auto &m_sum = (num_nodes == 1) ? reduced_mem_sum : reduced_node_mem_sum;
+    std::string title =
+        (num_nodes == 1) ? "Peak Memory Growth per Rank" : "Peak Memory Growth per Node";
+    PrintTable(comm, title, "Min.", "Max.", "Tot.",
+               [&](int i) -> std::tuple<std::string, std::string, std::string>
+               {
+                 return {memory_reporting::FormatBytes(m_min[i]),
+                         memory_reporting::FormatBytes(m_max[i]),
+                         memory_reporting::FormatBytes(m_sum[i])};
+               });
   }
 };
 
diff --git a/test/unit/test-memoryreporting.cpp b/test/unit/test-memoryreporting.cpp
index 69ad8fe1e2..85834b604c 100644
--- a/test/unit/test-memoryreporting.cpp
+++ b/test/unit/test-memoryreporting.cpp
@@ -5,6 +5,7 @@
 #include <catch2/catch_test_macros.hpp>
 #include "utils/communication.hpp"
 #include "utils/memoryreporting.hpp"
+#include "utils/timer.hpp"
 
 using namespace palace;
 using namespace palace::memory_reporting;
@@ -80,3 +81,50 @@ TEST_CASE("Node Memory Stats Multi Process", "[memoryreporting][Parallel]")
   CHECK(peak_stats.min <= peak_stats.max);
   CHECK(peak_stats.sum >= peak_stats.min);
 }
+
+TEST_CASE("Timer Memory Data Invariants", "[memoryreporting][Serial]")
+{
+  Timer timer;
+
+  // MemoryLap returns a non-negative delta (peak RSS is non-decreasing).
+  auto delta = timer.MemoryLap();
+  CHECK(delta >= 0);
+
+  // MarkMemory accumulates into the correct index and leaves others at zero.
+  timer.MarkMemory(Timer::CONSTRUCT, 100);
+  timer.MarkMemory(Timer::CONSTRUCT, 200);
+  CHECK(timer.MemoryData(Timer::CONSTRUCT) == 300);
+  CHECK(timer.MemoryData(Timer::KSP) == 0);
+
+  // TOTAL index assigns rather than accumulates.
+  timer.MarkMemory(Timer::TOTAL, 500);
+  timer.MarkMemory(Timer::TOTAL, 700);
+  CHECK(timer.MemoryData(Timer::TOTAL) == 700);
+
+  // MemoryFromStart is non-negative (peak can only grow).
+  CHECK(timer.MemoryFromStart() >= 0);
+}
+
+TEST_CASE("BlockTimer Finalize Contract", "[memoryreporting][Serial][Parallel]")
+{
+  // Finalize populates the stored reduction results.
+  BlockTimer::Finalize(MPI_COMM_WORLD);
+  CHECK(BlockTimer::IsFinalized());
+
+  // Stored vectors have the correct size.
+  CHECK(BlockTimer::NodeMemoryMin().size() == Timer::NUM_TIMINGS);
+  CHECK(BlockTimer::NodeMemoryMax().size() == Timer::NUM_TIMINGS);
+  CHECK(BlockTimer::NodeMemorySum().size() == Timer::NUM_TIMINGS);
+
+  // Per-node values are non-negative (peak RSS deltas can't be negative).
+  for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+  {
+    CHECK(BlockTimer::NodeMemoryMin()[i] >= 0.0);
+    CHECK(BlockTimer::NodeMemoryMax()[i] >= BlockTimer::NodeMemoryMin()[i]);
+    CHECK(BlockTimer::NodeMemorySum()[i] >= BlockTimer::NodeMemoryMax()[i]);
+  }
+
+  // Calling Finalize again does not crash (idempotent for AMR loop usage).
+  BlockTimer::Finalize(MPI_COMM_WORLD);
+  CHECK(BlockTimer::IsFinalized());
+}

From 9083d26b2848fc9e0cc2c13ca333c5b54775b6dc Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 10:10:50 -0700
Subject: [PATCH 2/7] Address review

---
 docs/src/developer/notes.md        |  20 ++--
 palace/drivers/basesolver.cpp      |  19 ++--
 palace/utils/timer.hpp             | 147 +++++++++++++----------------
 test/unit/test-memoryreporting.cpp |  28 ++----
 4 files changed, 93 insertions(+), 121 deletions(-)

diff --git a/docs/src/developer/notes.md b/docs/src/developer/notes.md
index 2b05871f6b..7248212768 100644
--- a/docs/src/developer/notes.md
+++ b/docs/src/developer/notes.md
@@ -219,8 +219,14 @@ Total                           // < Total simulation time
 
 ## Memory Reporting
 
-Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
-per-process snapshots and per-phase growth.
+Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
+per-process snapshots and per-phase growth. Goal of this reporting is to understand the
+maximum memory required by a *Palace* simulation. We do not track how much memory *Palace*
+uses at any given time, instead, we measure how much various phases of the simulation
+increase the high-water mark. This is useful to understand and reduce the total memory
+required by *Palace*. Note, this tool is not enough to comprehensively track the memory
+lifecycle because it does not see allocations/deallocations that do not increase the peak
+RSS.
 
 ### Per-process snapshots
 
@@ -232,7 +238,7 @@ solve.
 
 ### Per-phase memory growth
 
-The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per
+The `Timer`/`BlockTimer` system tracks not only elapsed time but also **peak** RSS growth per
 phase. Every `BlockTimer` scope automatically records how much the peak RSS increased during
 that phase, using the same stack-based interruption mechanism as timing: entering a nested
 scope attributes the memory growth so far to the outer scope, then starts tracking the inner
@@ -244,9 +250,8 @@ across all ranks). On multiple nodes, it shows per-node statistics (sum of ranks
 node) with min/max/total across nodes.
 
 The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
-`false`, both timing and memory tracking are disabled for that scope. This is used in tight
-loops (e.g., preconditioner application, coarse solve within iterative solvers) to avoid the
-overhead of `getrusage()` system calls.
+`false`, both timing and memory tracking are disabled for that scope. This can be used to
+exclude certain sections of code from a `BlockTimer`.
 
 Per-phase memory data is also saved to `palace.json` under the `PeakMemoryGrowthMegabytes`
 and `PeakNodeMemoryGrowthMegabytes` keys, alongside the existing `ElapsedTime` data.
@@ -256,7 +261,8 @@ and `PeakNodeMemoryGrowthMegabytes` keys, alongside the existing `ElapsedTime` d
 Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
 always non-negative. A phase that allocates memory temporarily and then frees it will still
 show the growth if the allocation pushed the peak. A phase showing zero growth means it did
-not exceed the previously established peak.
+not exceed the previously established peak. Note that this does not mean that no allocations
+were performed.
 
 Per-phase deltas may not sum exactly to the total because memory growth can occur between
 timed phases (e.g., during scope transitions or in code not wrapped by a `BlockTimer`).
diff --git a/palace/drivers/basesolver.cpp b/palace/drivers/basesolver.cpp
index c6a9e313d0..2b927acfd7 100644
--- a/palace/drivers/basesolver.cpp
+++ b/palace/drivers/basesolver.cpp
@@ -302,12 +302,7 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
   if (root)
   {
     constexpr double to_mb = 1.0 / (1024.0 * 1024.0);
-    const auto &rank_min = BlockTimer::RankMemoryMin();
-    const auto &rank_max = BlockTimer::RankMemoryMax();
-    const auto &rank_sum = BlockTimer::RankMemorySum();
-    const auto &node_min = BlockTimer::NodeMemoryMin();
-    const auto &node_max = BlockTimer::NodeMemoryMax();
-    const auto &node_sum = BlockTimer::NodeMemorySum();
+    auto red = BlockTimer::GetReductions();
 
     json meta = LoadMetadata(post_dir);
     for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
@@ -316,12 +311,12 @@ void BaseSolver::SaveMetadata(const Timer &timer) const
       key.erase(std::remove_if(key.begin(), key.end(), isspace), key.end());
       meta["ElapsedTime"]["Durations"][key] = timer.Data((Timer::Index)i);
       meta["ElapsedTime"]["Counts"][key] = timer.Counts((Timer::Index)i);
-      meta["PeakMemoryGrowthMegabytes"]["Min"][key] = rank_min[i] * to_mb;
-      meta["PeakMemoryGrowthMegabytes"]["Max"][key] = rank_max[i] * to_mb;
-      meta["PeakMemoryGrowthMegabytes"]["Sum"][key] = rank_sum[i] * to_mb;
-      meta["PeakNodeMemoryGrowthMegabytes"]["Min"][key] = node_min[i] * to_mb;
-      meta["PeakNodeMemoryGrowthMegabytes"]["Max"][key] = node_max[i] * to_mb;
-      meta["PeakNodeMemoryGrowthMegabytes"]["Sum"][key] = node_sum[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Min"][key] = red.rank_mem.min[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Max"][key] = red.rank_mem.max[i] * to_mb;
+      meta["PeakMemoryGrowthMegabytes"]["Sum"][key] = red.rank_mem.sum[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Min"][key] = red.node_mem.min[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Max"][key] = red.node_mem.max[i] * to_mb;
+      meta["PeakNodeMemoryGrowthMegabytes"]["Sum"][key] = red.node_mem.sum[i] * to_mb;
     }
     WriteMetadata(post_dir, meta);
   }
diff --git a/palace/utils/timer.hpp b/palace/utils/timer.hpp
index 351ff763a3..9e893cb5a4 100644
--- a/palace/utils/timer.hpp
+++ b/palace/utils/timer.hpp
@@ -184,15 +184,22 @@ class BlockTimer
   bool count;
 
   // Stored reduction results (populated by Finalize).
-  inline static std::vector<double> reduced_time_min;
-  inline static std::vector<double> reduced_time_max;
-  inline static std::vector<double> reduced_time_avg;
-  inline static std::vector<double> reduced_mem_min;
-  inline static std::vector<double> reduced_mem_max;
-  inline static std::vector<double> reduced_mem_sum;
-  inline static std::vector<double> reduced_node_mem_min;
-  inline static std::vector<double> reduced_node_mem_max;
-  inline static std::vector<double> reduced_node_mem_sum;
+  struct ReducedData
+  {
+    std::vector<double> min, max, sum, avg;
+    void resize(int n)
+    {
+      min.resize(n);
+      max.resize(n);
+      sum.resize(n);
+      avg.resize(n);
+    }
+    bool empty() const { return min.empty(); }
+  };
+
+  inline static ReducedData reduced_time;
+  inline static ReducedData reduced_rank_mem;
+  inline static ReducedData reduced_node_mem;
   inline static int num_nodes = 0;
 
   // Print a summary table with three columns. The row_fn callback produces the three
@@ -259,44 +266,19 @@ class BlockTimer
   // Read-only access the static Timer object.
   static const Timer &GlobalTimer() { return timer; }
 
-  // Access stored per-rank memory reduction results (populated by Finalize).
-  static const std::vector<double> &RankMemoryMin()
-  {
-    MFEM_VERIFY(IsFinalized(),
-                "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_mem_min;
-  }
-  static const std::vector<double> &RankMemoryMax()
+  // Access all stored reduction results (populated by Finalize).
+  struct Reductions
   {
-    MFEM_VERIFY(IsFinalized(),
-                "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_mem_max;
-  }
-  static const std::vector<double> &RankMemorySum()
-  {
-    MFEM_VERIFY(IsFinalized(),
-                "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_mem_sum;
-  }
-
-  // Access stored per-node memory reduction results (populated by Finalize).
-  static const std::vector<double> &NodeMemoryMin()
-  {
-    MFEM_VERIFY(IsFinalized(),
-                "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_node_mem_min;
-  }
-  static const std::vector<double> &NodeMemoryMax()
-  {
-    MFEM_VERIFY(IsFinalized(),
-                "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_node_mem_max;
-  }
-  static const std::vector<double> &NodeMemorySum()
+    const ReducedData &time;
+    const ReducedData &rank_mem;
+    const ReducedData &node_mem;
+    int num_nodes;
+  };
+  static Reductions GetReductions()
   {
     MFEM_VERIFY(IsFinalized(),
                 "BlockTimer::Finalize() must be called before accessing results!");
-    return reduced_node_mem_sum;
+    return {reduced_time, reduced_rank_mem, reduced_node_mem, num_nodes};
   }
 
   // Finalize timers and perform MPI reductions. Must be called before Print().
@@ -312,48 +294,44 @@ class BlockTimer
     timer.MarkTime(Timer::TOTAL, timer.TimeFromStart());
     timer.MarkMemory(Timer::TOTAL, timer.MemoryFromStart());
 
+    const int n = Timer::NUM_TIMINGS;
+    const int np = Mpi::Size(comm);
+
     // Reduce timing data across ranks.
-    reduced_time_min.resize(Timer::NUM_TIMINGS);
-    reduced_time_max.resize(Timer::NUM_TIMINGS);
-    reduced_time_avg.resize(Timer::NUM_TIMINGS);
-    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    reduced_time.resize(n);
+    for (int i = Timer::INIT; i < n; i++)
     {
-      reduced_time_min[i] = reduced_time_max[i] = reduced_time_avg[i] =
+      reduced_time.min[i] = reduced_time.max[i] = reduced_time.avg[i] =
           timer.Data((Timer::Index)i);
     }
-    Mpi::GlobalMin(Timer::NUM_TIMINGS, reduced_time_min.data(), comm);
-    Mpi::GlobalMax(Timer::NUM_TIMINGS, reduced_time_max.data(), comm);
-    Mpi::GlobalSum(Timer::NUM_TIMINGS, reduced_time_avg.data(), comm);
-    const int np = Mpi::Size(comm);
-    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    Mpi::GlobalMin(n, reduced_time.min.data(), comm);
+    Mpi::GlobalMax(n, reduced_time.max.data(), comm);
+    Mpi::GlobalSum(n, reduced_time.avg.data(), comm);
+    for (int i = Timer::INIT; i < n; i++)
     {
-      reduced_time_avg[i] /= np;
+      reduced_time.avg[i] /= np;
     }
 
     // Reduce per-rank memory data across ranks.
-    reduced_mem_min.resize(Timer::NUM_TIMINGS);
-    reduced_mem_max.resize(Timer::NUM_TIMINGS);
-    reduced_mem_sum.resize(Timer::NUM_TIMINGS);
-    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    reduced_rank_mem.resize(n);
+    for (int i = Timer::INIT; i < n; i++)
     {
-      reduced_mem_min[i] = reduced_mem_max[i] = reduced_mem_sum[i] =
+      reduced_rank_mem.min[i] = reduced_rank_mem.max[i] = reduced_rank_mem.sum[i] =
           static_cast<double>(timer.MemoryData((Timer::Index)i));
     }
-    Mpi::GlobalMin(Timer::NUM_TIMINGS, reduced_mem_min.data(), comm);
-    Mpi::GlobalMax(Timer::NUM_TIMINGS, reduced_mem_max.data(), comm);
-    Mpi::GlobalSum(Timer::NUM_TIMINGS, reduced_mem_sum.data(), comm);
+    Mpi::GlobalMin(n, reduced_rank_mem.min.data(), comm);
+    Mpi::GlobalMax(n, reduced_rank_mem.max.data(), comm);
+    Mpi::GlobalSum(n, reduced_rank_mem.sum.data(), comm);
 
     // Reduce per-node memory data across nodes.
-    reduced_node_mem_min.resize(Timer::NUM_TIMINGS);
-    reduced_node_mem_max.resize(Timer::NUM_TIMINGS);
-    reduced_node_mem_sum.resize(Timer::NUM_TIMINGS);
-    for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+    reduced_node_mem.resize(n);
+    for (int i = Timer::INIT; i < n; i++)
     {
       auto stats = memory_reporting::ComputeNodeMemoryStats(
           "", timer.MemoryData((Timer::Index)i), comm);
-      reduced_node_mem_min[i] = static_cast<double>(stats.min);
-      reduced_node_mem_max[i] = static_cast<double>(stats.max);
-      reduced_node_mem_sum[i] = static_cast<double>(stats.sum);
+      reduced_node_mem.min[i] = static_cast<double>(stats.min);
+      reduced_node_mem.max[i] = static_cast<double>(stats.max);
+      reduced_node_mem.sum[i] = static_cast<double>(stats.sum);
     }
 
     // Count nodes.
@@ -367,7 +345,7 @@ class BlockTimer
   }
 
   // Whether Finalize has been called.
-  static bool IsFinalized() { return !reduced_time_min.empty(); }
+  static bool IsFinalized() { return !reduced_time.empty(); }
 
   // Print timing and memory tables from stored reduction results.
   static void Print(MPI_Comm comm)
@@ -378,24 +356,25 @@ class BlockTimer
     PrintTable(comm, "Elapsed Time Report (s)", "Min.", "Max.", "Avg.",
                [&](int i) -> std::tuple<std::string, std::string, std::string>
                {
-                 return {fmt::format("{:.{}f}", reduced_time_min[i], p),
-                         fmt::format("{:.{}f}", reduced_time_max[i], p),
-                         fmt::format("{:.{}f}", reduced_time_avg[i], p)};
+                 return {fmt::format("{:.{}f}", reduced_time.min[i], p),
+                         fmt::format("{:.{}f}", reduced_time.max[i], p),
+                         fmt::format("{:.{}f}", reduced_time.avg[i], p)};
                });
 
-    // Memory table. Single node: per-rank min/max/total. Multi-node: per-node
-    // min/max/total.
-    const auto &m_min = (num_nodes == 1) ? reduced_mem_min : reduced_node_mem_min;
-    const auto &m_max = (num_nodes == 1) ? reduced_mem_max : reduced_node_mem_max;
-    const auto &m_sum = (num_nodes == 1) ? reduced_mem_sum : reduced_node_mem_sum;
-    std::string title =
-        (num_nodes == 1) ? "Peak Memory Growth per Rank" : "Peak Memory Growth per Node";
-    PrintTable(comm, title, "Min.", "Max.", "Tot.",
+    // Memory table: per-node max, total across nodes, and cumulative total HWM.
+    const auto &nm = reduced_node_mem;
+    double hwm = 0.0;
+    PrintTable(comm, "Peak Memory", "Per-Node", "Total", "Total HWM",
                [&](int i) -> std::tuple<std::string, std::string, std::string>
                {
-                 return {memory_reporting::FormatBytes(m_min[i]),
-                         memory_reporting::FormatBytes(m_max[i]),
-                         memory_reporting::FormatBytes(m_sum[i])};
+                 if (i != Timer::TOTAL)
+                 {
+                   hwm += nm.sum[i];
+                 }
+                 return {memory_reporting::FormatBytes(nm.max[i]),
+                         memory_reporting::FormatBytes(nm.sum[i]),
+                         (i == Timer::TOTAL) ? memory_reporting::FormatBytes(nm.sum[i])
+                                             : memory_reporting::FormatBytes(hwm)};
                });
   }
 };
diff --git a/test/unit/test-memoryreporting.cpp b/test/unit/test-memoryreporting.cpp
index 85834b604c..4f632a294c 100644
--- a/test/unit/test-memoryreporting.cpp
+++ b/test/unit/test-memoryreporting.cpp
@@ -105,26 +105,18 @@ TEST_CASE("Timer Memory Data Invariants", "[memoryreporting][Serial]")
   CHECK(timer.MemoryFromStart() >= 0);
 }
 
-TEST_CASE("BlockTimer Finalize Contract", "[memoryreporting][Serial][Parallel]")
+// BlockTimer uses inline statics, so all BlockTimer tests must live in a single TEST_CASE
+// (or run last) to avoid polluting state for other tests in the same process.
+TEST_CASE("BlockTimer scopes attribute memory to the correct phase",
+          "[memoryreporting][Serial]")
 {
-  // Finalize populates the stored reduction results.
-  BlockTimer::Finalize(MPI_COMM_WORLD);
-  CHECK(BlockTimer::IsFinalized());
-
-  // Stored vectors have the correct size.
-  CHECK(BlockTimer::NodeMemoryMin().size() == Timer::NUM_TIMINGS);
-  CHECK(BlockTimer::NodeMemoryMax().size() == Timer::NUM_TIMINGS);
-  CHECK(BlockTimer::NodeMemorySum().size() == Timer::NUM_TIMINGS);
-
-  // Per-node values are non-negative (peak RSS deltas can't be negative).
-  for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
   {
-    CHECK(BlockTimer::NodeMemoryMin()[i] >= 0.0);
-    CHECK(BlockTimer::NodeMemoryMax()[i] >= BlockTimer::NodeMemoryMin()[i]);
-    CHECK(BlockTimer::NodeMemorySum()[i] >= BlockTimer::NodeMemoryMax()[i]);
+    BlockTimer bt(Timer::CONSTRUCT);
+    // volatile prevents the compiler from optimizing away the allocation.
+    volatile std::vector<char> buf(64 * 1024 * 1024, 1);
   }
-
-  // Calling Finalize again does not crash (idempotent for AMR loop usage).
   BlockTimer::Finalize(MPI_COMM_WORLD);
-  CHECK(BlockTimer::IsFinalized());
+  auto red = BlockTimer::GetReductions();
+  CHECK(red.rank_mem.max[Timer::CONSTRUCT] > 0);
+  CHECK(red.rank_mem.max[Timer::KSP] == 0);
 }

From 842321533e7382bfacf6451e86185854ea399298 Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 10:19:24 -0700
Subject: [PATCH 3/7] Update docs

---
 docs/src/developer/notes.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/docs/src/developer/notes.md b/docs/src/developer/notes.md
index 7248212768..9209aeea05 100644
--- a/docs/src/developer/notes.md
+++ b/docs/src/developer/notes.md
@@ -245,9 +245,11 @@ scope attributes the memory growth so far to the outer scope, then starts tracki
 scope.
 
 The per-phase memory table is printed alongside the timing table at the end of each
-simulation. On a single node, the table shows per-rank statistics with min/max/total (sum
-across all ranks). On multiple nodes, it shows per-node statistics (sum of ranks within each
-node) with min/max/total across nodes.
+simulation. It has three columns:
+
+  - Per-Node: maximum growth on any single node for this phase.
+  - Total: sum of growth across all nodes for this phase.
+  - Total HWM (High Water Mark): high-water mark growth at the end of this phase summed across all the nodes.
 
 The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
 `false`, both timing and memory tracking are disabled for that scope. This can be used to

From 01da1b4cbc5a25a27e2a1778762beb7a50a906f7 Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 10:20:42 -0700
Subject: [PATCH 4/7] Fix test

---
 test/unit/test-memoryreporting.cpp | 16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/test/unit/test-memoryreporting.cpp b/test/unit/test-memoryreporting.cpp
index 4f632a294c..599f27f1b4 100644
--- a/test/unit/test-memoryreporting.cpp
+++ b/test/unit/test-memoryreporting.cpp
@@ -105,18 +105,4 @@ TEST_CASE("Timer Memory Data Invariants", "[memoryreporting][Serial]")
   CHECK(timer.MemoryFromStart() >= 0);
 }
 
-// BlockTimer uses inline statics, so all BlockTimer tests must live in a single TEST_CASE
-// (or run last) to avoid polluting state for other tests in the same process.
-TEST_CASE("BlockTimer scopes attribute memory to the correct phase",
-          "[memoryreporting][Serial]")
-{
-  {
-    BlockTimer bt(Timer::CONSTRUCT);
-    // volatile prevents the compiler from optimizing away the allocation.
-    volatile std::vector<char> buf(64 * 1024 * 1024, 1);
-  }
-  BlockTimer::Finalize(MPI_COMM_WORLD);
-  auto red = BlockTimer::GetReductions();
-  CHECK(red.rank_mem.max[Timer::CONSTRUCT] > 0);
-  CHECK(red.rank_mem.max[Timer::KSP] == 0);
-}
+

From 2e8c80983340348174c76808fda6ba0b7d3be660 Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 10:34:58 -0700
Subject: [PATCH 5/7] Add changelog for 715

---
 CHANGELOG.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index add4118a9e..a6b1929f4c 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -26,6 +26,8 @@ The format of this changelog is based on
 
 #### Bug Fixes
 
+  - Reduced memory usage when `MaxIts` for GMRES is larger than the number of
+    required iterations. [PR 715](https://github.com/awslabs/palace/pull/715)
   - Improved IO performance for simulations with Adaptive Mesh Refinement. Now,
     files from previous iterations are moved instead of being copied. The latest
     output is always available at the top level of the output directory (as

From 3da9576045e5355de5808a2892a90428339a963a Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 10:39:23 -0700
Subject: [PATCH 6/7] fix style

---
 test/unit/test-memoryreporting.cpp | 2 --
 1 file changed, 2 deletions(-)

diff --git a/test/unit/test-memoryreporting.cpp b/test/unit/test-memoryreporting.cpp
index 599f27f1b4..0df94cfd95 100644
--- a/test/unit/test-memoryreporting.cpp
+++ b/test/unit/test-memoryreporting.cpp
@@ -104,5 +104,3 @@ TEST_CASE("Timer Memory Data Invariants", "[memoryreporting][Serial]")
   // MemoryFromStart is non-negative (peak can only grow).
   CHECK(timer.MemoryFromStart() >= 0);
 }
-
-

From 175022074ab51cc2de258f20ac3ddfbc91fb56ac Mon Sep 17 00:00:00 2001
From: Gabriele Bozzola <gbozzola@amazon.com>
Date: Wed, 22 Apr 2026 12:53:48 -0700
Subject: [PATCH 7/7] fix nit

---
 docs/src/developer/notes.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/developer/notes.md b/docs/src/developer/notes.md
index 9209aeea05..afb04cd4ec 100644
--- a/docs/src/developer/notes.md
+++ b/docs/src/developer/notes.md
@@ -220,7 +220,7 @@ Total                           // < Total simulation time
 ## Memory Reporting
 
 Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
-per-process snapshots and per-phase growth. Goal of this reporting is to understand the
+per-process snapshots and per-phase growth. The goal of this reporting is to understand the
 maximum memory required by a *Palace* simulation. We do not track how much memory *Palace*
 uses at any given time, instead, we measure how much various phases of the simulation
 increase the high-water mark. This is useful to understand and reduce the total memory