-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209
Conversation
Support optional suffixes: 100 or 100B -> 100 bytes 100K or 100KB -> 102400 bytes 100M or 100MB -> 104857600 bytes 100G or 100GB -> 107374182400 bytes
Previously, we defined syrk by checking I==J. This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J (it will happen when we'll split Q window and multiply different vertical bands of P)
… blocks for each MPI group, refactor compute_block_residues() Fixes #203 bigint-syrk-blas: add --maxSharedMemory option to limit MPI shared memory window sizes TODO: currently it fails if limit is to small. We should split input and output windows instead.
…ory limit In unit tests, we test two cases: - no memory limit (no P window splitting) - memory limit ensuring that only 3 rows fit into P window. see calculate_matrix_square.test.cxx In end-to-end.test.cxx, we set --maxSharedMemory=1M for two realistic cases, thus enforcing split factors 4 and 6. In other cases, limit is not set. TODO: update Readme.md TODO: split also Q (output) window, if necessary.
…culating total_size Result is different when input_window_split_factor > 1.
…e_split_factor to remove ambiguity
…e --maxSharedMemory limit TODO: update also bigint_syrk/Readme.md Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q. Also supported both uplo=UPPER and LOWER for syrk. Fixed reduce_scatter(): Old version synchronized only upper half always, but for off-diagonal blocks Q_IJ, we need to synchronize all.
Benchmarks for 4 nodes, 128 cores per node, 256GB RAM SDP sizes:
Benchmarks for timing run (2nd solver iteration) for different memory limits:
|
For nmax=18, setting SDP sizes:
|
What I don't understand very well is extra memory usage associated with large windows. This means that the difference between two cases is ~900GB of shared memory (as expected), plus ~500GB of something else. Is it due to BLAS jobs? P.S. For nmax=14, comparing the first and the last column, we see ~50GB/node difference for shared memory plus ~10GB/node of something else. |
Fixes:
Added
--maxSharedMemory
option, e.g.--maxSharedMemory=128G
. By default it is set to infinity.If the total size of P and Q windows exceeds the limit, then we split the windows as follows:
New matrix multiplication algorithm with splitting:
Testing:
If you run SDPB with
--verbosity debug
, you'll get information about window sizes, e.g.: