-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive IO and RAM usage when reading big SDP blocks #164
Comments
Now only the first rank from each group reads block_data and sends it to other ranks via MPI send/recv, see read_block_stream() + copy_matrix_from_root() TODO: refactor, so that other ranks do not open filestreams etc. TODO: measure performance and RAM, compare with old version TODO: use shared memory window instead of MPI send/recv, compare performance
BenchmarksDataSome tests for GNY model, nmax=14 on Expanse HPC, nodes=4 procsPerNode=128 procGranularity=1.
Summary
Note that IO times can be unstable: for example, for another problem I got several IO time ~30s for several runs (for each algorithm), and then got one run with ~3s. Anyway, sdp reading time is small compared to solver iteration (~190s for the same case), so even if it becomes a bit slower due to synchronization, this is not a problem. Fixing RAM issues for large problems is more important. |
Simple time estimates:
where One can conclude that |
Fix #164 Excessive IO and RAM usage when reading big SDP blocks
Now only the first rank from each group reads block_data and sends it to other ranks via MPI shared memory window, see read_block_data() + copy_matrix_from_root() + File structure and refactoring for SDP/read_block_data + Added timers to SDP constructor and read_block_data()
If some SDP block XXX is shared among many processes, each process reads
block_data_XXX.bin
fromsdp.zip
, but most of the read data is not used.For example, if two processes share a block, then both read the whole local
Matrix B
fromblock_data_XXX.bin
.After that, they created and fill a distributed matrix
DistMatrix B
: the first process copies columns 0,2,4... fromMatrix B
toDistMatrix B
, and the second process copies columns 1,3,5...As a result, each process does not use half of its
Matrix B
.The situation is worse when a block is shared among larger number of processes.
For example, I observed ~85GB RAM usage for reading ~8GB sdp.zip, when some big blocks were assigned to 18 processes each.
(Same sdp.zip required only ~60GB when these blocks were shared among 10 processes).
Possible solutions:
The text was updated successfully, but these errors were encountered: