Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive IO and RAM usage when reading big SDP blocks #164

Closed
vasdommes opened this issue Dec 23, 2023 · 2 comments
Closed

Excessive IO and RAM usage when reading big SDP blocks #164

vasdommes opened this issue Dec 23, 2023 · 2 comments
Assignees
Milestone

Comments

@vasdommes
Copy link
Collaborator

vasdommes commented Dec 23, 2023

If some SDP block XXX is shared among many processes, each process reads block_data_XXX.bin from sdp.zip, but most of the read data is not used.

For example, if two processes share a block, then both read the whole local Matrix B from block_data_XXX.bin.
After that, they created and fill a distributed matrix DistMatrix B: the first process copies columns 0,2,4... from Matrix B to DistMatrix B, and the second process copies columns 1,3,5...
As a result, each process does not use half of its Matrix B.

The situation is worse when a block is shared among larger number of processes.
For example, I observed ~85GB RAM usage for reading ~8GB sdp.zip, when some big blocks were assigned to 18 processes each.
(Same sdp.zip required only ~60GB when these blocks were shared among 10 processes).

Possible solutions:

  1. Decrease RAM usage: serialize matrices, e.g., column by column. Then each process will discard a column immediately, if it belongs to other process.
  2. Decrease IO: first process in each group reads a block and sends data to other processes or writes to shared memory window.
@vasdommes vasdommes added this to the Backlog milestone Dec 23, 2023
@vasdommes vasdommes self-assigned this Dec 23, 2023
vasdommes added a commit that referenced this issue Dec 27, 2023
Now only the first rank from each group reads block_data and sends it to other ranks via MPI send/recv,
see read_block_stream() + copy_matrix_from_root()

TODO: refactor, so that other ranks do not open filestreams etc.
TODO: measure performance and RAM, compare with old version
TODO: use shared memory window instead of MPI send/recv, compare performance
@vasdommes
Copy link
Collaborator Author

Benchmarks

Data

Some tests for GNY model, nmax=14 on Expanse HPC, nodes=4 procsPerNode=128 procGranularity=1.
Data for the first (big) block is shared among 8 processes.
Timings are shown for rank=0, RAM usage - for node=0.

  1. SDPB v2.6.1, each process reads block_data and copies it to distributed matrices in SDP:
read_sdp: 17.8s
RAM increase: 29.2 GB
  1. Only first process in each group reads block_data and sends to other ranks via MPI_Send/MPI_Recv:
read_sdp: 6.2s
    parse block_data: 3.6s
    synchronize: 2.3s
RAM increase: 15.1 GB
  1. Only first process in each group reads block_data and sends to other ranks via MPI shared memory window:
read_sdp: 4.7s
    parse block_data: 3.7s
    synchronize: 0.8s
RAM increase: 15.1 GB

Summary

  1. When only the first process is reading from disk, we don't have extra copies of each block and need ~2x less RAM for reading.
  2. When only the first process is reading from disk, we need extra time for synchronization. Despite that, we still get overall ~3x speedup, because disk IO load is reduced, and IO time is several times faster.
  3. Synchronizing data through shared memory window is ~3x faster than via MPI_Send/MPI_Recv. Thus, we should use shared memory window.

Note that IO times can be unstable: for example, for another problem I got several IO time ~30s for several runs (for each algorithm), and then got one run with ~3s.

Anyway, sdp reading time is small compared to solver iteration (~190s for the same case), so even if it becomes a bit slower due to synchronization, this is not a problem. Fixing RAM issues for large problems is more important.

@vasdommes
Copy link
Collaborator Author

Simple time estimates:

  1. send/recv:
MAX( #(M) * (serialize + send) , #(M) / (N - 1) * (recv + deserialize) )
  1. shared window:
#(M) * serialize + #(M) / (N - 1) * (read + deserialize) 

where #(M) is matrix size and N is a number of processes in a group.

One can conclude that send is expensive compared to read+deserialize. In my tests, shared window implementation is more that 1.5x faster even for N=2.

vasdommes added a commit that referenced this issue Dec 28, 2023
Fix #164 Excessive IO and RAM usage when reading big SDP blocks
bharathr98 pushed a commit to bharathr98/sdpb that referenced this issue Mar 1, 2024
Now only the first rank from each group reads block_data and sends it to other ranks via MPI shared memory window,
see read_block_data() + copy_matrix_from_root()

+ File structure and refactoring for SDP/read_block_data
+ Added timers to SDP constructor and read_block_data()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant