Skip to content

Add more detailed memory reporting#708

Merged
Sbozzolo merged 7 commits intomainfrom
gbozzola/more_memory_reporting
Apr 22, 2026
Merged

Add more detailed memory reporting#708
Sbozzolo merged 7 commits intomainfrom
gbozzola/more_memory_reporting

Conversation

@Sbozzolo
Copy link
Copy Markdown
Member

@Sbozzolo Sbozzolo commented Apr 16, 2026

I expanded the timer system to collect memory information.

A new table is printed at the end of each run:

Peak Memory Growth per Rank        Min.        Max.        Tot.
==============================================================
Initialization                    3.0M        9.0M        1.1G
  Mesh Preprocessing              0.0K       40.9M      391.9M
Operator Construction            12.0M       33.0M        3.1G
Linear Solve                      0.0K        4.5M      736.5M
  Setup                           1.5M        7.5M     1021.5M
  Preconditioner                  0.0K        1.5M       31.5M
  Coarse Solve                   86.7M      170.2M       25.2G
Eigenvalue Solve                  0.0K        3.0M      225.5M
Div.-Free Projection              6.0M       79.5M        6.0G
Estimation                        0.0K        0.0K        0.0K
  Construction                    0.0K        4.5M      312.0M
  Solve                           0.0K        0.0K        0.0K
Postprocessing                    0.0K        1.5M        4.5M
  Paraview                        0.0K        0.0K        0.0K
  Grid function                   0.0K        0.0K        0.0K
Disk IO                           3.0M       48.0M      621.0M
--------------------------------------------------------------
Total                           244.8M      303.6M       49.4G

Each row of this table should be interpreted as "this phase pushed the peak memory by this amount".

This information is also added to the palace.json.

I collect both per-rank and per-node information. The above table reports per-rank information when there is only one node, otherwise it reports per-node.

In this, I slightly refactored the Timer interface to split the calculations into a Finalize function (instead of inside Print).

@Sbozzolo Sbozzolo added enhancement New feature or request no-long-tests This PR does not require the long tests to be merged labels Apr 16, 2026
@Sbozzolo Sbozzolo force-pushed the gbozzola/more_memory_reporting branch 4 times, most recently from 29c4a26 to 20ac288 Compare April 20, 2026 16:25
@Sbozzolo Sbozzolo self-assigned this Apr 20, 2026
@Sbozzolo Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 20ac288 to b2f3ca5 Compare April 21, 2026 15:19
@Sbozzolo Sbozzolo marked this pull request as ready for review April 21, 2026 15:19
@Sbozzolo Sbozzolo requested a review from hughcars April 21, 2026 15:19
hughcars pushed a commit that referenced this pull request Apr 22, 2026
With the tools I added in PR #708, I am now in the position where I can
look at optimizing how memory is used in Palace.

One of the first places I found is with the allocation of H.

Currently, we allocate the full Hessenberg matrix with size max_dim *
max_dim, even when only fewer iterations are required. For cases where
max_dim (= MaxIts) is large, this can be very significant.

Here, I allocate H incrementally with the GMRES. In my test case,
incremental allocation reduced the total memory from 600 GB to 200 GB.
@Sbozzolo Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from b2f3ca5 to ffe3e60 Compare April 22, 2026 15:13
Copy link
Copy Markdown
Collaborator

@hughcars hughcars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful feature, some suggestions, mostly minor things, but one possibly a bit bigger. Given this is for collecting data for decisions, I was trying to think of some extra data that might be useful in future, so we don't have to change this new table in future (and your historical samples are richer going forward).

Comment thread docs/src/developer/notes.md Outdated
The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is
`false`, both timing and memory tracking are disabled for that scope. This is used in tight
loops (e.g., preconditioner application, coarse solve within iterative solvers) to avoid the
overhead of `getrusage()` system calls.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're really seeing an effect of BlockTimer construction and destruction, that suggests the timer should be hoisted out of whatever loop is in play.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood what count was for. I thought it was for performance, but it is to separate accounting, as I understand from here: #147

I will update the docs.

Comment thread palace/utils/timer.hpp
Comment thread docs/src/developer/notes.md Outdated
Comment on lines +222 to +223
Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
per-process snapshots and per-phase growth.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a sentence here about how the goal of this reporting is to understand the maximum memory required by a job, I think reporting the high water mark is very practically useful, but looking at the tables users might get confused and think that error estimation etc doesn't use any memory maybe. Additionally as developers it doesn't tell us about all wasteful memory usages, (which might also affect speed for example), just the "fattest section". A sentence to that effect here would be informative for someone skimming. I know it's covered below, but that's the most potentially surprising feature, so pays to draw attention to it very briefly before the main section.

Comment thread docs/src/developer/notes.md Outdated

### Per-phase memory growth

The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per
The `Timer`/`BlockTimer` system tracks not only elapsed time but also *peak* RSS growth per

bold or italicize the peak, it's pretty key.

Comment thread docs/src/developer/notes.md
Comment thread docs/src/developer/notes.md Outdated
Comment on lines +256 to +259
Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are
always non-negative. A phase that allocates memory temporarily and then frees it will still
show the growth if the allocation pushed the peak. A phase showing zero growth means it did
not exceed the previously established peak.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a metric of number of allocation requests made? I.e. if the high water never changes, but we request millions of new chunks only to deallocate straight after, that is probably not great performance wise. As above though, I think such a thing is likely an extension not for this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be useful, but I suspect it would require more aggressive instrumentation (either continuous polling or injecting code in mallocs/news).

Comment thread test/unit/test-memoryreporting.cpp Outdated
Comment on lines +108 to +130
TEST_CASE("BlockTimer Finalize Contract", "[memoryreporting][Serial][Parallel]")
{
// Finalize populates the stored reduction results.
BlockTimer::Finalize(MPI_COMM_WORLD);
CHECK(BlockTimer::IsFinalized());

// Stored vectors have the correct size.
CHECK(BlockTimer::NodeMemoryMin().size() == Timer::NUM_TIMINGS);
CHECK(BlockTimer::NodeMemoryMax().size() == Timer::NUM_TIMINGS);
CHECK(BlockTimer::NodeMemorySum().size() == Timer::NUM_TIMINGS);

// Per-node values are non-negative (peak RSS deltas can't be negative).
for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++)
{
CHECK(BlockTimer::NodeMemoryMin()[i] >= 0.0);
CHECK(BlockTimer::NodeMemoryMax()[i] >= BlockTimer::NodeMemoryMin()[i]);
CHECK(BlockTimer::NodeMemorySum()[i] >= BlockTimer::NodeMemoryMax()[i]);
}

// Calling Finalize again does not crash (idempotent for AMR loop usage).
BlockTimer::Finalize(MPI_COMM_WORLD);
CHECK(BlockTimer::IsFinalized());
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No BlockTimer scope is ever entered in the test body. The static BlockTimer::timer.mem_data is zero-initialised (from the Timer constructor at timer.hpp:97-101). All reduced_* vectors are filled with zeros. Checks:

NodeMemoryMin()[i] >= 0.0 → 0 >= 0
NodeMemoryMax()[i] >= NodeMemoryMin()[i] → 0 >= 0 ✓
NodeMemorySum()[i] >= NodeMemoryMax()[i] → 0 >= 0 ✓

A hypothetical Finalize that simply calls resize(NUM_TIMINGS) on the reduced vectors and returns also passes. A hypothetical BlockTimer constructor/destructor that never called MarkMemory at all also passes.

One possible suggestion:

TEST_CASE("BlockTimer scopes attribute memory to the correct phase",
          "[memoryreporting][Serial]")
{
  {
    BlockTimer bt(Timer::CONSTRUCT);
    std::vector<char> buf(8 * 1024 * 1024, 1);
    asm volatile("" :: "g"(buf.data()) : "memory");
  }
  BlockTimer::Finalize(MPI_COMM_WORLD);
  CHECK(BlockTimer::RankMemoryMax()[Timer::CONSTRUCT] > 0);
  CHECK(BlockTimer::RankMemoryMax()[Timer::KSP] == 0);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot, but I've already run into problems writing unit tests for my other memory reporting feature. The test measures peak memory, but if it is run in a process where peak memory is set by some other prior test, the allocation might not push the high-water-mark. We also don't know the order when tests are run.

I think the simplest solution is to remove the test.

Comment thread palace/utils/timer.hpp Outdated
Comment on lines +391 to +399
std::string title =
(num_nodes == 1) ? "Peak Memory Growth per Rank" : "Peak Memory Growth per Node";
PrintTable(comm, title, "Min.", "Max.", "Tot.",
[&](int i) -> std::tuple<std::string, std::string, std::string>
{
return {memory_reporting::FormatBytes(m_min[i]),
memory_reporting::FormatBytes(m_max[i]),
memory_reporting::FormatBytes(m_sum[i])};
});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, are min/max/sum the most important here? We're looking at the high water mark, so min of a max isn't that informative. The important ones are per node, and total memory. Then I think we could have something interesting if you tracked the peak and the retained, something a bit like

                               Δ Peak RSS    Δ Current RSS
=========================================================
Mesh Preprocessing              +40.9M        +40.9M       (kept)
Operator Construction           +33.0M        +28.0M       (mostly kept)
Coarse Solve (workspace)       +170.0M         -2.0M       (temporary)
Postprocessing                   +0.0K         +0.5M       (page-touching)

That would tell us if a section allocates and deallocates a big block of memory (maybe we should be reusing that allocation elsewhere?) Not sure if this is possible the polling approach you have, but if it is, that could be pretty interesting to see.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, are min/max/sum the most important here?

I added them as a proxy of load-balance. Min/Max essentially give me information about the root process. Sum here is the total per each phase. That's the most important one and the one I look at the most.

I agree that they are not as useful in most cases, so I can remove them from the table and keep them on the json only.

I thought about using the Current RSS as well, but it becomes a little harder to use/interpret because we measure Current RSS at a specific point in time (when the Timer is destructed). So, you could allocate a big block of memory and deallocate it within the same scope but before we measure Current RSS and that would be invisible to the Current RSS Timer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I see, it's almost like we'd want a timer whilst it was in existence to poll continuously until its destruction, but that's getting into a more intrusive thing.

Comment thread palace/utils/timer.hpp Outdated
Comment on lines +186 to +189
// Stored reduction results (populated by Finalize).
inline static std::vector<double> reduced_time_min;
inline static std::vector<double> reduced_time_max;
inline static std::vector<double> reduced_time_avg;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible can organize these for a little less copy-paste, but the LOC gain is minimal (unless you want to add more pieces to track)

private:
  struct ReducedSeries { std::vector<double> min, max, sum; };
  struct ReducedAvg    : ReducedSeries { std::vector<double> avg; };

  inline static ReducedAvg    reduced_time;
  inline static ReducedSeries reduced_rank_mem;
  inline static ReducedSeries reduced_node_mem;
  inline static int           num_nodes = 0;

public:
  struct Reductions {
    const ReducedAvg&    time;
    const ReducedSeries& rank_mem;
    const ReducedSeries& node_mem;
    int                  num_nodes;
  };
  static Reductions GetReductions();  // MFEM_VERIFY(IsFinalized()) once, here

SaveMetadata's six separate accessor calls at basesolver.cpp:305-310 collapse to one const auto r = BlockTimer::GetReductions();, the Finalize-before-access contract lives at one chokepoint, and the Timer& cleanup above can ride in the same diff.

Sbozzolo added a commit that referenced this pull request Apr 22, 2026
With the tools I added in PR #708, I am now in the position where I can
look at optimizing how memory is used in Palace.

One of the first places I found is with the allocation of H.

Currently, we allocate the full Hessenberg matrix with size max_dim *
max_dim, even when only fewer iterations are required. For cases where
max_dim (= MaxIts) is large, this can be very significant.

Here, I allocate H incrementally with the GMRES. In my test case,
incremental allocation reduced the total memory from 600 GB to 200 GB.
I expanded the timer system to collect memory information.

A new table is printed at the end of each run:
```
Peak Memory Growth per Rank        Min.        Max.        Tot.
==============================================================
Initialization                    3.0M        9.0M        1.1G
  Mesh Preprocessing              0.0K       40.9M      391.9M
Operator Construction            12.0M       33.0M        3.1G
Linear Solve                      0.0K        4.5M      736.5M
  Setup                           1.5M        7.5M     1021.5M
  Preconditioner                  0.0K        1.5M       31.5M
  Coarse Solve                   86.7M      170.2M       25.2G
Eigenvalue Solve                  0.0K        3.0M      225.5M
Div.-Free Projection              6.0M       79.5M        6.0G
Estimation                        0.0K        0.0K        0.0K
  Construction                    0.0K        4.5M      312.0M
  Solve                           0.0K        0.0K        0.0K
Postprocessing                    0.0K        1.5M        4.5M
  Paraview                        0.0K        0.0K        0.0K
  Grid function                   0.0K        0.0K        0.0K
Disk IO                           3.0M       48.0M      621.0M
--------------------------------------------------------------
Total                           244.8M      303.6M       49.4G
```

Each row of this table should be interpreted as "this phase pushed the
peak memory by this amount".

This information is also added to the `palace.json`.

I collect both per-rank and per-node information. The above table
reports per-rank information when there is only one node, otherwise it
reports per-node.
@Sbozzolo Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 705eb1d to 1291d89 Compare April 22, 2026 17:21
@Sbozzolo Sbozzolo force-pushed the gbozzola/more_memory_reporting branch from 1291d89 to 01da1b4 Compare April 22, 2026 17:31
@Sbozzolo Sbozzolo requested a review from hughcars April 22, 2026 18:17
Copy link
Copy Markdown
Collaborator

@hughcars hughcars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a follow up to have the inside a phase deltas, but no need to hold this up.

Comment thread docs/src/developer/notes.md Outdated
Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities:
per-process snapshots and per-phase growth.
Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities:
per-process snapshots and per-phase growth. Goal of this reporting is to understand the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The goal

@Sbozzolo Sbozzolo enabled auto-merge (squash) April 22, 2026 19:54
@Sbozzolo Sbozzolo merged commit f05f062 into main Apr 22, 2026
72 of 74 checks passed
@Sbozzolo Sbozzolo deleted the gbozzola/more_memory_reporting branch April 22, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request no-long-tests This PR does not require the long tests to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants