out-of-memory events #97

airmler · 2021-08-03T13:40:41Z

Hi,
I have OOM events when working with quite large matrix dimensions.

I am working with commit a7c6bb3
on Intel XEON Platinum 8360Y with 72 cores per node and 256 GB RAM)
using the following libraries: intel-mpi, intel-mkl, and intel-mpi (not too old versions).

When working on 2 of my 72 core machines and executing the following command:
srun /u/airmler/src/COSMA/miniapp/cosma_miniapp -m 36044 -n 36044 -k 36044 -r 5
(which is identical to mpirun -np 144 $EXE, when working with another wrapper),
I obtain a OOM event (slurm kills the job because some process is OOM).

However, he says in stdout:
Required memory per rank (in #elements): 171415254
which is around 1.3GB. This is way less than the 3.7GB per rank I should have available.

What helps is when I reduce the used memory:
export COSMA_CPU_MAX_MEMORY=1100
is running without problems.

On this cluster, it is not so easy to profile the memory consumption. Are you aware of any load inbalances? Is the provided "required memory per rank" reliable?

===
Note: it is not important for me that this issue is resolved. I simply want to share this behavior with you.
I am sure you can transfer the matrix dimensions in a way you can reproduce the problem at your cluster.
Otherwise I could try to run some suggested examples on "my" machine.

Best regards.

The text was updated successfully, but these errors were encountered:

kabicm · 2021-09-28T16:28:12Z

Hi @airmler!

I discussed a similar OOM problem yesterday with @rohany.

Let me first ask a short question: how many ranks per node are you running? If you have 3.7GB available per rank, does it mean that you have around 256/3.7=70 ranks per node?

The required memory estimate in COSMA shows how much memory will have to be present throughout the whole execution of the algorithm. However, during the execution, some additional, temporary buffers are allocated and deallocated few times and this memory is not included in the estimate. However, these temporary buffers should not make a significant impact, as they are not dominant. What is possible is that MPI itself also requires some memory for its temporary buffers under the hood, and this might make a significant impact.

If you want to fine-tune COSMA for memory-limited cases, you can try adding: -s "sm2" or -s "sm2,sn2" or -s "sm2,sn2,sk2" as a command-line option to cosma_miniapp. The -s flag corresponds to the "splitting" strategy and the meaning of each triplet, say "sm2" is that a sequential step ("s") will be performed to split the dimension m by 2 and then COSMA will be run on each half. This will reduce the amount of required memory, but will also affect the performance.

When you write export COSMA_CPU_MAX_MEMORY=1100, COSMA will try to add some sequential steps (as we previously described) that should reduce the amount of memory used. However, this might reduce the memory too much and thus have a large impact on the peformance. For this reason, manually specifying how to split the dimensions as described in the previous paragraph might yield better performance results.

Let me know if this works for you!

Cheers,
Marko

kabicm · 2021-10-04T12:34:01Z

Hi @airmler, I will close this issue now, but feel free to reopen it anytime in case you still have some questions!

airmler · 2021-10-04T12:43:35Z

Thanks for your explanation.
I was using 72 cores per node, ie 3.55GB per core.
Still little bit puzzling for me: if this is really a OOM event, your estimate would be more than 2x (or 2GB) away from truth. Hard to believe that this is due to temporary mpi buffer.
Thanks for your time.

kabicm · 2021-10-04T14:43:42Z

Indeed, but there is one more thing I forgot to mention: COSMA allocates one large memory pool (which is really just an std::vector under the hood) and all the buffers are taken from this pool. The estimate is actually the total size of this memory pool. However, this also means that COSMA requires all this memory to be allocated as a single, consecutive piece of memory. So, even if there is enough memory at first sight, it doesn't help if it's fragmented.

We are considering switching to some well-established memory pool implementation, that would avoid this problem.

kabicm · 2021-10-04T14:49:38Z

Let's discuss this further in a separate issue: #99 .

kabicm closed this as completed Oct 4, 2021

kabicm mentioned this issue Oct 4, 2021

Switching to a proper memory-pool implementation #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out-of-memory events #97

out-of-memory events #97

airmler commented Aug 3, 2021 •

edited

kabicm commented Sep 28, 2021

kabicm commented Oct 4, 2021

airmler commented Oct 4, 2021

kabicm commented Oct 4, 2021

kabicm commented Oct 4, 2021

out-of-memory events #97

out-of-memory events #97

Comments

airmler commented Aug 3, 2021 • edited

kabicm commented Sep 28, 2021

kabicm commented Oct 4, 2021

airmler commented Oct 4, 2021

kabicm commented Oct 4, 2021

kabicm commented Oct 4, 2021

airmler commented Aug 3, 2021 •

edited