New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
out-of-memory events #97
Comments
Hi @airmler! I discussed a similar OOM problem yesterday with @rohany. Let me first ask a short question: how many ranks per node are you running? If you have 3.7GB available per rank, does it mean that you have around 256/3.7=70 ranks per node? The required memory estimate in COSMA shows how much memory will have to be present throughout the whole execution of the algorithm. However, during the execution, some additional, temporary buffers are allocated and deallocated few times and this memory is not included in the estimate. However, these temporary buffers should not make a significant impact, as they are not dominant. What is possible is that MPI itself also requires some memory for its temporary buffers under the hood, and this might make a significant impact. If you want to fine-tune COSMA for memory-limited cases, you can try adding: When you write Let me know if this works for you! Cheers, |
Hi @airmler, I will close this issue now, but feel free to reopen it anytime in case you still have some questions! |
Thanks for your explanation. |
Indeed, but there is one more thing I forgot to mention: COSMA allocates one large memory pool (which is really just an We are considering switching to some well-established memory pool implementation, that would avoid this problem. |
Let's discuss this further in a separate issue: #99 . |
Hi,
I have OOM events when working with quite large matrix dimensions.
I am working with commit a7c6bb3
on Intel XEON Platinum 8360Y with 72 cores per node and 256 GB RAM)
using the following libraries: intel-mpi, intel-mkl, and intel-mpi (not too old versions).
When working on 2 of my 72 core machines and executing the following command:
srun /u/airmler/src/COSMA/miniapp/cosma_miniapp -m 36044 -n 36044 -k 36044 -r 5
(which is identical to mpirun -np 144 $EXE, when working with another wrapper),
I obtain a OOM event (slurm kills the job because some process is OOM).
However, he says in stdout:
Required memory per rank (in #elements): 171415254
which is around 1.3GB. This is way less than the 3.7GB per rank I should have available.
What helps is when I reduce the used memory:
export COSMA_CPU_MAX_MEMORY=1100
is running without problems.
On this cluster, it is not so easy to profile the memory consumption. Are you aware of any load inbalances? Is the provided "required memory per rank" reliable?
===
Note: it is not important for me that this issue is resolved. I simply want to share this behavior with you.
I am sure you can transfer the matrix dimensions in a way you can reproduce the problem at your cluster.
Otherwise I could try to run some suggested examples on "my" machine.
Best regards.
The text was updated successfully, but these errors were encountered: