Conversation
29c4a26 to
20ac288
Compare
20ac288 to
b2f3ca5
Compare
With the tools I added in PR #708, I am now in the position where I can look at optimizing how memory is used in Palace. One of the first places I found is with the allocation of H. Currently, we allocate the full Hessenberg matrix with size max_dim * max_dim, even when only fewer iterations are required. For cases where max_dim (= MaxIts) is large, this can be very significant. Here, I allocate H incrementally with the GMRES. In my test case, incremental allocation reduced the total memory from 600 GB to 200 GB.
b2f3ca5 to
ffe3e60
Compare
hughcars
left a comment
There was a problem hiding this comment.
Useful feature, some suggestions, mostly minor things, but one possibly a bit bigger. Given this is for collecting data for decisions, I was trying to think of some extra data that might be useful in future, so we don't have to change this new table in future (and your historical samples are richer going forward).
| The `BlockTimer` constructor accepts a `count` parameter (default `true`). When `count` is | ||
| `false`, both timing and memory tracking are disabled for that scope. This is used in tight | ||
| loops (e.g., preconditioner application, coarse solve within iterative solvers) to avoid the | ||
| overhead of `getrusage()` system calls. |
There was a problem hiding this comment.
If we're really seeing an effect of BlockTimer construction and destruction, that suggests the timer should be hoisted out of whatever loop is in play.
There was a problem hiding this comment.
I misunderstood what count was for. I thought it was for performance, but it is to separate accounting, as I understand from here: #147
I will update the docs.
| Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities: | ||
| per-process snapshots and per-phase growth. |
There was a problem hiding this comment.
Maybe add a sentence here about how the goal of this reporting is to understand the maximum memory required by a job, I think reporting the high water mark is very practically useful, but looking at the tables users might get confused and think that error estimation etc doesn't use any memory maybe. Additionally as developers it doesn't tell us about all wasteful memory usages, (which might also affect speed for example), just the "fattest section". A sentence to that effect here would be informative for someone skimming. I know it's covered below, but that's the most potentially surprising feature, so pays to draw attention to it very briefly before the main section.
|
|
||
| ### Per-phase memory growth | ||
|
|
||
| The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per |
There was a problem hiding this comment.
| The `Timer`/`BlockTimer` system tracks not only elapsed time but also peak RSS growth per | |
| The `Timer`/`BlockTimer` system tracks not only elapsed time but also *peak* RSS growth per |
bold or italicize the peak, it's pretty key.
| Peak RSS is monotonically non-decreasing (the OS high-water mark), so per-phase deltas are | ||
| always non-negative. A phase that allocates memory temporarily and then frees it will still | ||
| show the growth if the allocation pushed the peak. A phase showing zero growth means it did | ||
| not exceed the previously established peak. |
There was a problem hiding this comment.
Is there a metric of number of allocation requests made? I.e. if the high water never changes, but we request millions of new chunks only to deallocate straight after, that is probably not great performance wise. As above though, I think such a thing is likely an extension not for this PR.
There was a problem hiding this comment.
That would be useful, but I suspect it would require more aggressive instrumentation (either continuous polling or injecting code in mallocs/news).
| TEST_CASE("BlockTimer Finalize Contract", "[memoryreporting][Serial][Parallel]") | ||
| { | ||
| // Finalize populates the stored reduction results. | ||
| BlockTimer::Finalize(MPI_COMM_WORLD); | ||
| CHECK(BlockTimer::IsFinalized()); | ||
|
|
||
| // Stored vectors have the correct size. | ||
| CHECK(BlockTimer::NodeMemoryMin().size() == Timer::NUM_TIMINGS); | ||
| CHECK(BlockTimer::NodeMemoryMax().size() == Timer::NUM_TIMINGS); | ||
| CHECK(BlockTimer::NodeMemorySum().size() == Timer::NUM_TIMINGS); | ||
|
|
||
| // Per-node values are non-negative (peak RSS deltas can't be negative). | ||
| for (int i = Timer::INIT; i < Timer::NUM_TIMINGS; i++) | ||
| { | ||
| CHECK(BlockTimer::NodeMemoryMin()[i] >= 0.0); | ||
| CHECK(BlockTimer::NodeMemoryMax()[i] >= BlockTimer::NodeMemoryMin()[i]); | ||
| CHECK(BlockTimer::NodeMemorySum()[i] >= BlockTimer::NodeMemoryMax()[i]); | ||
| } | ||
|
|
||
| // Calling Finalize again does not crash (idempotent for AMR loop usage). | ||
| BlockTimer::Finalize(MPI_COMM_WORLD); | ||
| CHECK(BlockTimer::IsFinalized()); | ||
| } |
There was a problem hiding this comment.
No BlockTimer scope is ever entered in the test body. The static BlockTimer::timer.mem_data is zero-initialised (from the Timer constructor at timer.hpp:97-101). All reduced_* vectors are filled with zeros. Checks:
NodeMemoryMin()[i] >= 0.0 → 0 >= 0
NodeMemoryMax()[i] >= NodeMemoryMin()[i] → 0 >= 0 ✓
NodeMemorySum()[i] >= NodeMemoryMax()[i] → 0 >= 0 ✓
A hypothetical Finalize that simply calls resize(NUM_TIMINGS) on the reduced vectors and returns also passes. A hypothetical BlockTimer constructor/destructor that never called MarkMemory at all also passes.
One possible suggestion:
TEST_CASE("BlockTimer scopes attribute memory to the correct phase",
"[memoryreporting][Serial]")
{
{
BlockTimer bt(Timer::CONSTRUCT);
std::vector<char> buf(8 * 1024 * 1024, 1);
asm volatile("" :: "g"(buf.data()) : "memory");
}
BlockTimer::Finalize(MPI_COMM_WORLD);
CHECK(BlockTimer::RankMemoryMax()[Timer::CONSTRUCT] > 0);
CHECK(BlockTimer::RankMemoryMax()[Timer::KSP] == 0);
}There was a problem hiding this comment.
I forgot, but I've already run into problems writing unit tests for my other memory reporting feature. The test measures peak memory, but if it is run in a process where peak memory is set by some other prior test, the allocation might not push the high-water-mark. We also don't know the order when tests are run.
I think the simplest solution is to remove the test.
| std::string title = | ||
| (num_nodes == 1) ? "Peak Memory Growth per Rank" : "Peak Memory Growth per Node"; | ||
| PrintTable(comm, title, "Min.", "Max.", "Tot.", | ||
| [&](int i) -> std::tuple<std::string, std::string, std::string> | ||
| { | ||
| return {memory_reporting::FormatBytes(m_min[i]), | ||
| memory_reporting::FormatBytes(m_max[i]), | ||
| memory_reporting::FormatBytes(m_sum[i])}; | ||
| }); |
There was a problem hiding this comment.
Thinking about this, are min/max/sum the most important here? We're looking at the high water mark, so min of a max isn't that informative. The important ones are per node, and total memory. Then I think we could have something interesting if you tracked the peak and the retained, something a bit like
Δ Peak RSS Δ Current RSS
=========================================================
Mesh Preprocessing +40.9M +40.9M (kept)
Operator Construction +33.0M +28.0M (mostly kept)
Coarse Solve (workspace) +170.0M -2.0M (temporary)
Postprocessing +0.0K +0.5M (page-touching)
That would tell us if a section allocates and deallocates a big block of memory (maybe we should be reusing that allocation elsewhere?) Not sure if this is possible the polling approach you have, but if it is, that could be pretty interesting to see.
There was a problem hiding this comment.
Thinking about this, are min/max/sum the most important here?
I added them as a proxy of load-balance. Min/Max essentially give me information about the root process. Sum here is the total per each phase. That's the most important one and the one I look at the most.
I agree that they are not as useful in most cases, so I can remove them from the table and keep them on the json only.
I thought about using the Current RSS as well, but it becomes a little harder to use/interpret because we measure Current RSS at a specific point in time (when the Timer is destructed). So, you could allocate a big block of memory and deallocate it within the same scope but before we measure Current RSS and that would be invisible to the Current RSS Timer.
There was a problem hiding this comment.
Ah yes I see, it's almost like we'd want a timer whilst it was in existence to poll continuously until its destruction, but that's getting into a more intrusive thing.
| // Stored reduction results (populated by Finalize). | ||
| inline static std::vector<double> reduced_time_min; | ||
| inline static std::vector<double> reduced_time_max; | ||
| inline static std::vector<double> reduced_time_avg; |
There was a problem hiding this comment.
Possible can organize these for a little less copy-paste, but the LOC gain is minimal (unless you want to add more pieces to track)
private:
struct ReducedSeries { std::vector<double> min, max, sum; };
struct ReducedAvg : ReducedSeries { std::vector<double> avg; };
inline static ReducedAvg reduced_time;
inline static ReducedSeries reduced_rank_mem;
inline static ReducedSeries reduced_node_mem;
inline static int num_nodes = 0;
public:
struct Reductions {
const ReducedAvg& time;
const ReducedSeries& rank_mem;
const ReducedSeries& node_mem;
int num_nodes;
};
static Reductions GetReductions(); // MFEM_VERIFY(IsFinalized()) once, here
SaveMetadata's six separate accessor calls at basesolver.cpp:305-310 collapse to one const auto r = BlockTimer::GetReductions();, the Finalize-before-access contract lives at one chokepoint, and the Timer& cleanup above can ride in the same diff.
With the tools I added in PR #708, I am now in the position where I can look at optimizing how memory is used in Palace. One of the first places I found is with the allocation of H. Currently, we allocate the full Hessenberg matrix with size max_dim * max_dim, even when only fewer iterations are required. For cases where max_dim (= MaxIts) is large, this can be very significant. Here, I allocate H incrementally with the GMRES. In my test case, incremental allocation reduced the total memory from 600 GB to 200 GB.
I expanded the timer system to collect memory information. A new table is printed at the end of each run: ``` Peak Memory Growth per Rank Min. Max. Tot. ============================================================== Initialization 3.0M 9.0M 1.1G Mesh Preprocessing 0.0K 40.9M 391.9M Operator Construction 12.0M 33.0M 3.1G Linear Solve 0.0K 4.5M 736.5M Setup 1.5M 7.5M 1021.5M Preconditioner 0.0K 1.5M 31.5M Coarse Solve 86.7M 170.2M 25.2G Eigenvalue Solve 0.0K 3.0M 225.5M Div.-Free Projection 6.0M 79.5M 6.0G Estimation 0.0K 0.0K 0.0K Construction 0.0K 4.5M 312.0M Solve 0.0K 0.0K 0.0K Postprocessing 0.0K 1.5M 4.5M Paraview 0.0K 0.0K 0.0K Grid function 0.0K 0.0K 0.0K Disk IO 3.0M 48.0M 621.0M -------------------------------------------------------------- Total 244.8M 303.6M 49.4G ``` Each row of this table should be interpreted as "this phase pushed the peak memory by this amount". This information is also added to the `palace.json`. I collect both per-rank and per-node information. The above table reports per-rank information when there is only one node, otherwise it reports per-node.
705eb1d to
1291d89
Compare
1291d89 to
01da1b4
Compare
hughcars
left a comment
There was a problem hiding this comment.
I think we should have a follow up to have the inside a phase deltas, but no need to hold this up.
| Memory reporting in *Palace* tracks peak RSS (Resident Set Size) at two granularities: | ||
| per-process snapshots and per-phase growth. | ||
| Memory reporting in *Palace* tracks **peak** RSS (Resident Set Size) at two granularities: | ||
| per-process snapshots and per-phase growth. Goal of this reporting is to understand the |
I expanded the timer system to collect memory information.
A new table is printed at the end of each run:
Each row of this table should be interpreted as "this phase pushed the peak memory by this amount".
This information is also added to the
palace.json.I collect both per-rank and per-node information. The above table reports per-rank information when there is only one node, otherwise it reports per-node.
In this, I slightly refactored the
Timerinterface to split the calculations into aFinalizefunction (instead of insidePrint).