Add more detailed memory reporting by Sbozzolo · Pull Request #708 · awslabs/palace

Sbozzolo · 2026-04-16T21:13:04Z

I expanded the timer system to collect memory information.

A new table is printed at the end of each run:

Peak Memory Growth per Rank        Min.        Max.        Tot.
==============================================================
Initialization                    3.0M        9.0M        1.1G
  Mesh Preprocessing              0.0K       40.9M      391.9M
Operator Construction            12.0M       33.0M        3.1G
Linear Solve                      0.0K        4.5M      736.5M
  Setup                           1.5M        7.5M     1021.5M
  Preconditioner                  0.0K        1.5M       31.5M
  Coarse Solve                   86.7M      170.2M       25.2G
Eigenvalue Solve                  0.0K        3.0M      225.5M
Div.-Free Projection              6.0M       79.5M        6.0G
Estimation                        0.0K        0.0K        0.0K
  Construction                    0.0K        4.5M      312.0M
  Solve                           0.0K        0.0K        0.0K
Postprocessing                    0.0K        1.5M        4.5M
  Paraview                        0.0K        0.0K        0.0K
  Grid function                   0.0K        0.0K        0.0K
Disk IO                           3.0M       48.0M      621.0M
--------------------------------------------------------------
Total                           244.8M      303.6M       49.4G

Each row of this table should be interpreted as "this phase pushed the peak memory by this amount".

This information is also added to the palace.json.

I collect both per-rank and per-node information. The above table reports per-rank information when there is only one node, otherwise it reports per-node.

In this, I slightly refactored the Timer interface to split the calculations into a Finalize function (instead of inside Print).

With the tools I added in PR #708, I am now in the position where I can look at optimizing how memory is used in Palace. One of the first places I found is with the allocation of H. Currently, we allocate the full Hessenberg matrix with size max_dim * max_dim, even when only fewer iterations are required. For cases where max_dim (= MaxIts) is large, this can be very significant. Here, I allocate H incrementally with the GMRES. In my test case, incremental allocation reduced the total memory from 600 GB to 200 GB.

hughcars

Useful feature, some suggestions, mostly minor things, but one possibly a bit bigger. Given this is for collecting data for decisions, I was trying to think of some extra data that might be useful in future, so we don't have to change this new table in future (and your historical samples are richer going forward).

hughcars · 2026-04-21T20:20:20Z

+The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
+`false`, both timing and memory tracking are disabled for that scope. This is used in tight
+loops (e.g., preconditioner application, coarse solve within iterative solvers) to avoid the
+overhead of `getrusage()` system calls.


If we're really seeing an effect of BlockTimer construction and destruction, that suggests the timer should be hoisted out of whatever loop is in play.

I misunderstood what count was for. I thought it was for performance, but it is to separate accounting, as I understand from here: #147

I will update the docs.

hughcars · 2026-04-22T14:40:00Z

+Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
+per-process snapshots and per-phase growth.


Maybe add a sentence here about how the goal of this reporting is to understand the maximum memory required by a job, I think reporting the high water mark is very practically useful, but looking at the tables users might get confused and think that error estimation etc doesn't use any memory maybe. Additionally as developers it doesn't tell us about all wasteful memory usages, (which might also affect speed for example), just the "fattest section". A sentence to that effect here would be informative for someone skimming. I know it's covered below, but that's the most potentially surprising feature, so pays to draw attention to it very briefly before the main section.

hughcars · 2026-04-22T14:43:16Z

+
+### Per-phase memory growth
+
+The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per


Suggested change

The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per

The `Timer`/`BlockTimer` system tracks not only elapsed time but also *peak* RSS growth per

bold or italicize the peak, it's pretty key.

hughcars · 2026-04-22T14:53:31Z

+Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
+always non-negative. A phase that allocates memory temporarily and then frees it will still
+show the growth if the allocation pushed the peak. A phase showing zero growth means it did
+not exceed the previously established peak.


Is there a metric of number of allocation requests made? I.e. if the high water never changes, but we request millions of new chunks only to deallocate straight after, that is probably not great performance wise. As above though, I think such a thing is likely an extension not for this PR.

That would be useful, but I suspect it would require more aggressive instrumentation (either continuous polling or injecting code in mallocs/news).

hughcars · 2026-04-22T15:02:40Z

+TEST_CASE("BlockTimer Finalize Contract", "[memoryreporting][Serial][Parallel]")
+{
+  // Finalize populates the stored reduction results.
+  BlockTimer::Finalize(MPI_COMM_WORLD);
+  CHECK(BlockTimer::IsFinalized());
+
+  // Stored vectors have the correct size.
+  CHECK(BlockTimer::NodeMemoryMin().size() == Timer::NUM_TIMINGS);
+  CHECK(BlockTimer::NodeMemoryMax().size() == Timer::NUM_TIMINGS);
+  CHECK(BlockTimer::NodeMemorySum().size() == Timer::NUM_TIMINGS);
+
+  // Per-node values are non-negative (peak RSS deltas can't be negative).
+  for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
+  {
+    CHECK(BlockTimer::NodeMemoryMin()[i] >= 0.0);
+    CHECK(BlockTimer::NodeMemoryMax()[i] >= BlockTimer::NodeMemoryMin()[i]);
+    CHECK(BlockTimer::NodeMemorySum()[i] >= BlockTimer::NodeMemoryMax()[i]);
+  }
+
+  // Calling Finalize again does not crash (idempotent for AMR loop usage).
+  BlockTimer::Finalize(MPI_COMM_WORLD);
+  CHECK(BlockTimer::IsFinalized());
+}


No BlockTimer scope is ever entered in the test body. The static BlockTimer::timer.mem_data is zero-initialised (from the Timer constructor at timer.hpp:97-101). All reduced_* vectors are filled with zeros. Checks:

NodeMemoryMin()[i] >= 0.0 → 0 >= 0
NodeMemoryMax()[i] >= NodeMemoryMin()[i] → 0 >= 0 ✓
NodeMemorySum()[i] >= NodeMemoryMax()[i] → 0 >= 0 ✓

A hypothetical Finalize that simply calls resize(NUM_TIMINGS) on the reduced vectors and returns also passes. A hypothetical BlockTimer constructor/destructor that never called MarkMemory at all also passes.

One possible suggestion:

TEST_CASE("BlockTimer scopes attribute memory to the correct phase", "[memoryreporting][Serial]") { { BlockTimer bt(Timer::CONSTRUCT); std::vector<char> buf(8 * 1024 * 1024, 1); asm volatile("" :: "g"(buf.data()) : "memory"); } BlockTimer::Finalize(MPI_COMM_WORLD); CHECK(BlockTimer::RankMemoryMax()[Timer::CONSTRUCT] > 0); CHECK(BlockTimer::RankMemoryMax()[Timer::KSP] == 0); }

I forgot, but I've already run into problems writing unit tests for my other memory reporting feature. The test measures peak memory, but if it is run in a process where peak memory is set by some other prior test, the allocation might not push the high-water-mark. We also don't know the order when tests are run.

I think the simplest solution is to remove the test.

hughcars · 2026-04-22T15:19:29Z

+    std::string title =
+        (num_nodes == 1) ? "Peak Memory Growth per Rank" : "Peak Memory Growth per Node";
+    PrintTable(comm, title, "Min.", "Max.", "Tot.",
+               [&](int i) -> std::tuple<std::string, std::string, std::string>
+               {
+                 return {memory_reporting::FormatBytes(m_min[i]),
+                         memory_reporting::FormatBytes(m_max[i]),
+                         memory_reporting::FormatBytes(m_sum[i])};
+               });


Thinking about this, are min/max/sum the most important here? We're looking at the high water mark, so min of a max isn't that informative. The important ones are per node, and total memory. Then I think we could have something interesting if you tracked the peak and the retained, something a bit like

Δ Peak RSS Δ Current RSS ========================================================= Mesh Preprocessing +40.9M +40.9M (kept) Operator Construction +33.0M +28.0M (mostly kept) Coarse Solve (workspace) +170.0M -2.0M (temporary) Postprocessing +0.0K +0.5M (page-touching)

That would tell us if a section allocates and deallocates a big block of memory (maybe we should be reusing that allocation elsewhere?) Not sure if this is possible the polling approach you have, but if it is, that could be pretty interesting to see.

Thinking about this, are min/max/sum the most important here?

I added them as a proxy of load-balance. Min/Max essentially give me information about the root process. Sum here is the total per each phase. That's the most important one and the one I look at the most.

I agree that they are not as useful in most cases, so I can remove them from the table and keep them on the json only.

I thought about using the Current RSS as well, but it becomes a little harder to use/interpret because we measure Current RSS at a specific point in time (when the Timer is destructed). So, you could allocate a big block of memory and deallocate it within the same scope but before we measure Current RSS and that would be invisible to the Current RSS Timer.

Ah yes I see, it's almost like we'd want a timer whilst it was in existence to poll continuously until its destruction, but that's getting into a more intrusive thing.

hughcars · 2026-04-22T15:21:04Z

+  // Stored reduction results (populated by Finalize).
+  inline static std::vector<double> reduced_time_min;
+  inline static std::vector<double> reduced_time_max;
+  inline static std::vector<double> reduced_time_avg;


Possible can organize these for a little less copy-paste, but the LOC gain is minimal (unless you want to add more pieces to track)

private: struct ReducedSeries { std::vector<double> min, max, sum; }; struct ReducedAvg : ReducedSeries { std::vector<double> avg; }; inline static ReducedAvg reduced_time; inline static ReducedSeries reduced_rank_mem; inline static ReducedSeries reduced_node_mem; inline static int num_nodes = 0; public: struct Reductions { const ReducedAvg& time; const ReducedSeries& rank_mem; const ReducedSeries& node_mem; int num_nodes; }; static Reductions GetReductions(); // MFEM_VERIFY(IsFinalized()) once, here

SaveMetadata's six separate accessor calls at basesolver.cpp:305-310 collapse to one const auto r = BlockTimer::GetReductions();, the Finalize-before-access contract lives at one chokepoint, and the Timer& cleanup above can ride in the same diff.

With the tools I added in PR #708, I am now in the position where I can look at optimizing how memory is used in Palace. One of the first places I found is with the allocation of H. Currently, we allocate the full Hessenberg matrix with size max_dim * max_dim, even when only fewer iterations are required. For cases where max_dim (= MaxIts) is large, this can be very significant. Here, I allocate H incrementally with the GMRES. In my test case, incremental allocation reduced the total memory from 600 GB to 200 GB.

I expanded the timer system to collect memory information. A new table is printed at the end of each run: ``` Peak Memory Growth per Rank Min. Max. Tot. ============================================================== Initialization 3.0M 9.0M 1.1G Mesh Preprocessing 0.0K 40.9M 391.9M Operator Construction 12.0M 33.0M 3.1G Linear Solve 0.0K 4.5M 736.5M Setup 1.5M 7.5M 1021.5M Preconditioner 0.0K 1.5M 31.5M Coarse Solve 86.7M 170.2M 25.2G Eigenvalue Solve 0.0K 3.0M 225.5M Div.-Free Projection 6.0M 79.5M 6.0G Estimation 0.0K 0.0K 0.0K Construction 0.0K 4.5M 312.0M Solve 0.0K 0.0K 0.0K Postprocessing 0.0K 1.5M 4.5M Paraview 0.0K 0.0K 0.0K Grid function 0.0K 0.0K 0.0K Disk IO 3.0M 48.0M 621.0M -------------------------------------------------------------- Total 244.8M 303.6M 49.4G ``` Each row of this table should be interpreted as "this phase pushed the peak memory by this amount". This information is also added to the `palace.json`. I collect both per-rank and per-node information. The above table reports per-rank information when there is only one node, otherwise it reports per-node.

hughcars

I think we should have a follow up to have the inside a phase deltas, but no need to hold this up.

hughcars · 2026-04-22T19:24:04Z

-Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
-per-process snapshots and per-phase growth.
+Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
+per-process snapshots and per-phase growth. Goal of this reporting is to understand the


Nit: The goal

Sbozzolo added enhancement New feature or request no-long-tests This PR does not require the long tests to be merged labels Apr 16, 2026

Sbozzolo force-pushed the gbozzola/more_memory_reporting branch 4 times, most recently from 29c4a26 to 20ac288 Compare April 20, 2026 16:25

Sbozzolo self-assigned this Apr 20, 2026

Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 20ac288 to b2f3ca5 Compare April 21, 2026 15:19

Sbozzolo marked this pull request as ready for review April 21, 2026 15:19

Sbozzolo requested a review from hughcars April 21, 2026 15:19

Sbozzolo mentioned this pull request Apr 21, 2026

Reduce memory usage by avoiding eager allocation of H #715

Merged

Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from b2f3ca5 to ffe3e60 Compare April 22, 2026 15:13

hughcars reviewed Apr 22, 2026

View reviewed changes

Sbozzolo added 3 commits April 22, 2026 10:21

Address review

9083d26

Update docs

8423215

Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 705eb1d to 1291d89 Compare April 22, 2026 17:21

Fix test

01da1b4

Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 1291d89 to 01da1b4 Compare April 22, 2026 17:31

Sbozzolo added 2 commits April 22, 2026 10:35

Add changelog for 715

2e8c809

fix style

3da9576

Sbozzolo requested a review from hughcars April 22, 2026 18:17

hughcars approved these changes Apr 22, 2026

View reviewed changes

fix nit

1750220

Sbozzolo enabled auto-merge (squash) April 22, 2026 19:54

Sbozzolo merged commit f05f062 into main Apr 22, 2026
72 of 74 checks passed

Sbozzolo deleted the gbozzola/more_memory_reporting branch April 22, 2026 21:25

		Memory reporting in Palace tracks peak RSS (Resident Set Size) at two granularities:
		per-process snapshots and per-phase growth.


		### Per-phase memory growth

		The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per

Conversation

Sbozzolo commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughcars left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughcars left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sbozzolo commented Apr 16, 2026 •

edited

Loading