Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

vasdommes · 2024-03-08T20:17:41Z

Fixes:

bigint-syrk-blas: account for MPI shared memory limits #207

Added --maxSharedMemory option, e.g. --maxSharedMemory=128G. By default it is set to infinity.
If the total size of P and Q windows exceeds the limit, then we split the windows as follows:

Choose minimal split factor for Q so that it fits into the memory limit (and leaves some room for P, namely 1 row per each MPI group).
Choose minimal split factor for P so that it fits into remaining memory.

New matrix multiplication algorithm with splitting:

1. Choose a set of primes.
2. Calculate column norms of P.
3. Normalize P and multiply by 2^N.
4. For each i,j=1..M:
   4.1 Fill Q window for with zeros.
   4.2 While not all rows of P are processed:
     4.2.1 Each MPI group takes top (H_g/s) remaining rows from its blocks and writes their residues (for corresponding columns) to the P window. Here H_g is the total number of blocks rows for the group g, and s is a split factor for P window.
     4.2.2 Call BLAS jobs to update Q window.
   4.3 Compute Q_group_ij from the residues stored in Q window.
   4.4 Reduce-scatter Q_group_ij from all nodes to the global Q_ij.
5. Restore Q (divide by 2^2N and remove normalization).

Testing:

Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting.
In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q.
Checked the results for realistic multi-node computations (fermions-3d, stress-tensors-3d) on Expanse.

If you run SDPB with --verbosity debug, you'll get information about window sizes, e.g.:

0 Allocate Shared_Window_Array, elements: 3127925220, size, GB: 23.3049
0 Allocate Shared_Window_Array, elements: 227516730, size, GB: 1.69513
create BigInt_Shared_Memory_Syrk_Context, rank=0
  Shared memory limit, bytes: 26843545600
  Number of primes: 105
  Blocks on the node:
    Total height: 45909
    Width: 5458
    Elements: 250571322
    Heights for each MPI group:
2190, 2190, 2190, 2190, 2190, 2190, 2190, 2130, 1920, 1800, 1680, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 420, 414, 408, 402, 396, 390, 384, 381, 381, 378, 378, 375, 375, 372, 372, 369, 369, 366, 366, 363, 363, 360, 360, 357, 357, 354, 354, 351
  Output residues window (Q):
    Window size, bytes: 25023401760
    Split factor: 1
    Height=Width (per prime): 5458
    Total elements (per prime): 29789764
  Input residues window (P):
  Number of windows: 1
    Window size, bytes: 1820133840
    Split factor: 126
    Height (per prime): 397
    Heights for each MPI group:
18, 18, 18, 18, 18, 18, 18, 17, 16, 15, 14, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
    MPI group sizes:
6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Warning: rank=0: BigInt_Shared_Memory_Syrk_Context: large input_window_split_factor=126 may affect performance. Consider increasing available shared memory per node.

Support optional suffixes: 100 or 100B -> 100 bytes 100K or 100KB -> 102400 bytes 100M or 100MB -> 104857600 bytes 100G or 100GB -> 107374182400 bytes

Previously, we defined syrk by checking I==J. This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J (it will happen when we'll split Q window and multiply different vertical bands of P)

… blocks for each MPI group, refactor compute_block_residues() Fixes #203 bigint-syrk-blas: add --maxSharedMemory option to limit MPI shared memory window sizes TODO: currently it fails if limit is to small. We should split input and output windows instead.

…ory limit In unit tests, we test two cases: - no memory limit (no P window splitting) - memory limit ensuring that only 3 rows fit into P window. see calculate_matrix_square.test.cxx In end-to-end.test.cxx, we set --maxSharedMemory=1M for two realistic cases, thus enforcing split factors 4 and 6. In other cases, limit is not set. TODO: update Readme.md TODO: split also Q (output) window, if necessary.

…culating total_size Result is different when input_window_split_factor > 1.

…e_split_factor to remove ambiguity

…e --maxSharedMemory limit TODO: update also bigint_syrk/Readme.md Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q. Also supported both uplo=UPPER and LOWER for syrk. Fixed reduce_scatter(): Old version synchronized only upper half always, but for off-diagonal blocks Q_IJ, we need to synchronize all.

vasdommes · 2024-03-08T23:46:14Z

Benchmarks for nmax=14 (from stress-tensors-3d project) show that splitting P window reduces memory usage without affecting performance. Performance degrades only if we split Q into many parts, probably because of residues of each P element have to be calculated multiple times.
(Note that setting the limit much smaller than the Q size makes sense only for extremely large jobs, where Q barely fits into a single machine.)

4 nodes, 128 cores per node, 256GB RAM

SDP sizes:

primal dimension: 90040
dual dimension: 3406
SDP blocks: 225
Total size of input P windows (on all nodes), without splitting: 176 GB (~44GB per node)
Output Q window size: 6.7GB

Benchmarks for timing run (2nd solver iteration) for different memory limits:

--maxSharedMemory	0GB (unlimited)	20GB	6GB	100MB
split factor for P	1	4	12	354
split factor for Q	1	1	2	10
RAM usage per node	143GB	94GB	81.8GB	79GB
Total iteration time	309s	311s	310s	429s
compute Q on node	149s	149s	153s	272s
reduce-scatter Q	107s	108s	104s	103s
other steps	53s	54s	53s	53s

vasdommes · 2024-03-09T01:50:50Z

For nmax=18, setting --maxSharedMemory=25GB (just above the size of Q window) allowed to reduce number of nodes from 10 to 4. Interestingly, total completion time remained almost the same.
Increasing number of nodes n makes all local computations faster (~1/n in perfect scaling regime), but reduce-scatter time grows linearly ~n.

SDP sizes:

primal dimension: 187328
dual dimension: 5458
SDP blocks: 369
Total size of input P windows (on all nodes), without splitting: 777 GB (~78GB per node)
Output Q window size: 22.6GB

vasdommes · 2024-03-09T02:06:37Z

What I don't understand very well is extra memory usage associated with large windows.
For example, for nmax=18 without window splitting, total RAM usage across 10 nodes is ~2400GB (shared windows take ~1000GB in total).
Since one node has only 256GB RAM, we should need at least 6 nodes even if shared window size is close to zero.
However, with --maxShareMemory=25G we can fit into 4 nodes, i.e. ~1000GB (shared windows take ~100GB in total).

This means that the difference between two cases is ~900GB of shared memory (as expected), plus ~500GB of something else. Is it due to BLAS jobs?

P.S. For nmax=14, comparing the first and the last column, we see ~50GB/node difference for shared memory plus ~10GB/node of something else.

vasdommes added 11 commits March 7, 2024 14:00

WIP --maxSharedMemory option (not used yet) #203

ac5ee1b

Support optional suffixes: 100 or 100B -> 100 bytes 100K or 100KB -> 102400 bytes 100M or 100MB -> 104857600 bytes 100G or 100GB -> 107374182400 bytes

Introduce Blas_Job::Kind to distinguish between syrk and gemm

045c9a6

Previously, we defined syrk by checking I==J. This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J (it will happen when we'll split Q window and multiply different vertical bands of P)

Calculate Blas_Job::cost in constructor

9974693

Fix update_block_timings_with_syrk(): use total_block_height when cal…

a0e7481

…culating total_size Result is different when input_window_split_factor > 1.

Extract do_blas_jobs() function

d435dea

calculate_matrix_square.test.cxx: rename split_factor -> blas_schedul…

881604d

…e_split_factor to remove ambiguity

If --maxSharedMemory=0, set it to infinity

f0c2170

Update docs for splitting windows: bigint_syrk/Readme.md and Usage.md

e31b2ba

vasdommes added enhancement performance labels Mar 8, 2024

vasdommes added this to the 2.8.0 milestone Mar 8, 2024

vasdommes self-assigned this Mar 8, 2024

vasdommes merged commit 7b52f8b into bigint-syrk-blas Mar 9, 2024
2 checks passed

vasdommes deleted the window-split branch March 9, 2024 02:07

vasdommes mentioned this pull request Mar 14, 2024

Optimize reduce-scatter for Q matrix #211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

vasdommes commented Mar 8, 2024 •

edited

Loading

vasdommes commented Mar 8, 2024 •

edited

Loading

vasdommes commented Mar 9, 2024

vasdommes commented Mar 9, 2024 •

edited

Loading

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

Conversation

vasdommes commented Mar 8, 2024 • edited Loading

vasdommes commented Mar 8, 2024 • edited Loading

vasdommes commented Mar 9, 2024

vasdommes commented Mar 9, 2024 • edited Loading

vasdommes commented Mar 8, 2024 •

edited

Loading

vasdommes commented Mar 8, 2024 •

edited

Loading

vasdommes commented Mar 9, 2024 •

edited

Loading