Excessive IO and RAM usage when reading big SDP blocks #164

vasdommes · 2023-12-23T05:54:25Z

If some SDP block XXX is shared among many processes, each process reads block_data_XXX.bin from sdp.zip, but most of the read data is not used.

For example, if two processes share a block, then both read the whole local Matrix B from block_data_XXX.bin.
After that, they created and fill a distributed matrix DistMatrix B: the first process copies columns 0,2,4... from Matrix B to DistMatrix B, and the second process copies columns 1,3,5...
As a result, each process does not use half of its Matrix B.

The situation is worse when a block is shared among larger number of processes.
For example, I observed ~85GB RAM usage for reading ~8GB sdp.zip, when some big blocks were assigned to 18 processes each.
(Same sdp.zip required only ~60GB when these blocks were shared among 10 processes).

Possible solutions:

Decrease RAM usage: serialize matrices, e.g., column by column. Then each process will discard a column immediately, if it belongs to other process.
Decrease IO: first process in each group reads a block and sends data to other processes or writes to shared memory window.

The text was updated successfully, but these errors were encountered:

Now only the first rank from each group reads block_data and sends it to other ranks via MPI send/recv, see read_block_stream() + copy_matrix_from_root() TODO: refactor, so that other ranks do not open filestreams etc. TODO: measure performance and RAM, compare with old version TODO: use shared memory window instead of MPI send/recv, compare performance

vasdommes · 2023-12-27T23:10:10Z

Benchmarks

Data

Some tests for GNY model, nmax=14 on Expanse HPC, nodes=4 procsPerNode=128 procGranularity=1.
Data for the first (big) block is shared among 8 processes.
Timings are shown for rank=0, RAM usage - for node=0.

SDPB v2.6.1, each process reads block_data and copies it to distributed matrices in SDP:

read_sdp: 17.8s
RAM increase: 29.2 GB

Only first process in each group reads block_data and sends to other ranks via MPI_Send/MPI_Recv:

read_sdp: 6.2s
    parse block_data: 3.6s
    synchronize: 2.3s
RAM increase: 15.1 GB

Only first process in each group reads block_data and sends to other ranks via MPI shared memory window:

read_sdp: 4.7s
    parse block_data: 3.7s
    synchronize: 0.8s
RAM increase: 15.1 GB

Summary

When only the first process is reading from disk, we don't have extra copies of each block and need ~2x less RAM for reading.
When only the first process is reading from disk, we need extra time for synchronization. Despite that, we still get overall ~3x speedup, because disk IO load is reduced, and IO time is several times faster.
Synchronizing data through shared memory window is ~3x faster than via MPI_Send/MPI_Recv. Thus, we should use shared memory window.

Note that IO times can be unstable: for example, for another problem I got several IO time ~30s for several runs (for each algorithm), and then got one run with ~3s.

Anyway, sdp reading time is small compared to solver iteration (~190s for the same case), so even if it becomes a bit slower due to synchronization, this is not a problem. Fixing RAM issues for large problems is more important.

vasdommes · 2023-12-28T20:46:02Z

Simple time estimates:

send/recv:

MAX( #(M) * (serialize + send) , #(M) / (N - 1) * (recv + deserialize) )

shared window:

#(M) * serialize + #(M) / (N - 1) * (read + deserialize)

where #(M) is matrix size and N is a number of processes in a group.

One can conclude that send is expensive compared to read+deserialize. In my tests, shared window implementation is more that 1.5x faster even for N=2.

Fix #164 Excessive IO and RAM usage when reading big SDP blocks

Now only the first rank from each group reads block_data and sends it to other ranks via MPI shared memory window, see read_block_data() + copy_matrix_from_root() + File structure and refactoring for SDP/read_block_data + Added timers to SDP constructor and read_block_data()

vasdommes added io performance labels Dec 23, 2023

vasdommes added this to the Backlog milestone Dec 23, 2023

vasdommes self-assigned this Dec 23, 2023

vasdommes mentioned this issue Dec 28, 2023

Fix #164 Excessive IO and RAM usage when reading big SDP blocks #167

Merged

vasdommes modified the milestones: Backlog, 2.6.2 Dec 28, 2023

vasdommes closed this as completed in e2cd5fb Dec 28, 2023

vasdommes added a commit that referenced this issue Dec 28, 2023

Merge pull request #167 from davidsd/opimize-sdp-read

2443d6d

Fix #164 Excessive IO and RAM usage when reading big SDP blocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive IO and RAM usage when reading big SDP blocks #164

Excessive IO and RAM usage when reading big SDP blocks #164

vasdommes commented Dec 23, 2023 •

edited

Loading

vasdommes commented Dec 27, 2023

vasdommes commented Dec 28, 2023

Excessive IO and RAM usage when reading big SDP blocks #164

Excessive IO and RAM usage when reading big SDP blocks #164

Comments

vasdommes commented Dec 23, 2023 • edited Loading

vasdommes commented Dec 27, 2023

Benchmarks

Data

Summary

vasdommes commented Dec 28, 2023

vasdommes commented Dec 23, 2023 •

edited

Loading