Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out-of-memory events #97

Closed
airmler opened this issue Aug 3, 2021 · 5 comments
Closed

out-of-memory events #97

airmler opened this issue Aug 3, 2021 · 5 comments

Comments

@airmler
Copy link

airmler commented Aug 3, 2021

Hi,
I have OOM events when working with quite large matrix dimensions.

I am working with commit a7c6bb3
on Intel XEON Platinum 8360Y with 72 cores per node and 256 GB RAM)
using the following libraries: intel-mpi, intel-mkl, and intel-mpi (not too old versions).

When working on 2 of my 72 core machines and executing the following command:
srun /u/airmler/src/COSMA/miniapp/cosma_miniapp -m 36044 -n 36044 -k 36044 -r 5
(which is identical to mpirun -np 144 $EXE, when working with another wrapper),
I obtain a OOM event (slurm kills the job because some process is OOM).

However, he says in stdout:
Required memory per rank (in #elements): 171415254
which is around 1.3GB. This is way less than the 3.7GB per rank I should have available.

What helps is when I reduce the used memory:
export COSMA_CPU_MAX_MEMORY=1100
is running without problems.

On this cluster, it is not so easy to profile the memory consumption. Are you aware of any load inbalances? Is the provided "required memory per rank" reliable?

===
Note: it is not important for me that this issue is resolved. I simply want to share this behavior with you.
I am sure you can transfer the matrix dimensions in a way you can reproduce the problem at your cluster.
Otherwise I could try to run some suggested examples on "my" machine.

Best regards.

@kabicm
Copy link
Collaborator

kabicm commented Sep 28, 2021

Hi @airmler!

I discussed a similar OOM problem yesterday with @rohany.

Let me first ask a short question: how many ranks per node are you running? If you have 3.7GB available per rank, does it mean that you have around 256/3.7=70 ranks per node?

The required memory estimate in COSMA shows how much memory will have to be present throughout the whole execution of the algorithm. However, during the execution, some additional, temporary buffers are allocated and deallocated few times and this memory is not included in the estimate. However, these temporary buffers should not make a significant impact, as they are not dominant. What is possible is that MPI itself also requires some memory for its temporary buffers under the hood, and this might make a significant impact.

If you want to fine-tune COSMA for memory-limited cases, you can try adding: -s "sm2" or -s "sm2,sn2" or -s "sm2,sn2,sk2" as a command-line option to cosma_miniapp. The -s flag corresponds to the "splitting" strategy and the meaning of each triplet, say "sm2" is that a sequential step ("s") will be performed to split the dimension m by 2 and then COSMA will be run on each half. This will reduce the amount of required memory, but will also affect the performance.

When you write export COSMA_CPU_MAX_MEMORY=1100, COSMA will try to add some sequential steps (as we previously described) that should reduce the amount of memory used. However, this might reduce the memory too much and thus have a large impact on the peformance. For this reason, manually specifying how to split the dimensions as described in the previous paragraph might yield better performance results.

Let me know if this works for you!

Cheers,
Marko

@kabicm
Copy link
Collaborator

kabicm commented Oct 4, 2021

Hi @airmler, I will close this issue now, but feel free to reopen it anytime in case you still have some questions!

@kabicm kabicm closed this as completed Oct 4, 2021
@airmler
Copy link
Author

airmler commented Oct 4, 2021

Thanks for your explanation.
I was using 72 cores per node, ie 3.55GB per core.
Still little bit puzzling for me: if this is really a OOM event, your estimate would be more than 2x (or 2GB) away from truth. Hard to believe that this is due to temporary mpi buffer.
Thanks for your time.

@kabicm
Copy link
Collaborator

kabicm commented Oct 4, 2021

Indeed, but there is one more thing I forgot to mention: COSMA allocates one large memory pool (which is really just an std::vector under the hood) and all the buffers are taken from this pool. The estimate is actually the total size of this memory pool. However, this also means that COSMA requires all this memory to be allocated as a single, consecutive piece of memory. So, even if there is enough memory at first sight, it doesn't help if it's fragmented.

We are considering switching to some well-established memory pool implementation, that would avoid this problem.

@kabicm
Copy link
Collaborator

kabicm commented Oct 4, 2021

Let's discuss this further in a separate issue: #99 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants