Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows) #209

Merged
merged 11 commits into from
Mar 9, 2024

Conversation

vasdommes
Copy link
Collaborator

@vasdommes vasdommes commented Mar 8, 2024

Fixes:

Added --maxSharedMemory option, e.g. --maxSharedMemory=128G. By default it is set to infinity.
If the total size of P and Q windows exceeds the limit, then we split the windows as follows:

  1. Choose minimal split factor for Q so that it fits into the memory limit (and leaves some room for P, namely 1 row per each MPI group).
  2. Choose minimal split factor for P so that it fits into remaining memory.

New matrix multiplication algorithm with splitting:

1. Choose a set of primes.
2. Calculate column norms of P.
3. Normalize P and multiply by 2^N.
4. For each i,j=1..M:
   4.1 Fill Q window for with zeros.
   4.2 While not all rows of P are processed:
     4.2.1 Each MPI group takes top (H_g/s) remaining rows from its blocks and writes their residues (for corresponding columns) to the P window. Here H_g is the total number of blocks rows for the group g, and s is a split factor for P window.
     4.2.2 Call BLAS jobs to update Q window.
   4.3 Compute Q_group_ij from the residues stored in Q window.
   4.4 Reduce-scatter Q_group_ij from all nodes to the global Q_ij.
5. Restore Q (divide by 2^2N and remove normalization).

Testing:

  • Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting.
  • In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q.
  • Checked the results for realistic multi-node computations (fermions-3d, stress-tensors-3d) on Expanse.

If you run SDPB with --verbosity debug, you'll get information about window sizes, e.g.:

0 Allocate Shared_Window_Array, elements: 3127925220, size, GB: 23.3049
0 Allocate Shared_Window_Array, elements: 227516730, size, GB: 1.69513
create BigInt_Shared_Memory_Syrk_Context, rank=0
  Shared memory limit, bytes: 26843545600
  Number of primes: 105
  Blocks on the node:
    Total height: 45909
    Width: 5458
    Elements: 250571322
    Heights for each MPI group:
2190, 2190, 2190, 2190, 2190, 2190, 2190, 2130, 1920, 1800, 1680, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 438, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 426, 420, 414, 408, 402, 396, 390, 384, 381, 381, 378, 378, 375, 375, 372, 372, 369, 369, 366, 366, 363, 363, 360, 360, 357, 357, 354, 354, 351
  Output residues window (Q):
    Window size, bytes: 25023401760
    Split factor: 1
    Height=Width (per prime): 5458
    Total elements (per prime): 29789764
  Input residues window (P):
  Number of windows: 1
    Window size, bytes: 1820133840
    Split factor: 126
    Height (per prime): 397
    Heights for each MPI group:
18, 18, 18, 18, 18, 18, 18, 17, 16, 15, 14, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
    MPI group sizes:
6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Warning: rank=0: BigInt_Shared_Memory_Syrk_Context: large input_window_split_factor=126 may affect performance. Consider increasing available shared memory per node.

Support optional suffixes:
100 or 100B -> 100 bytes
100K or 100KB -> 102400 bytes
100M or 100MB -> 104857600 bytes
100G or 100GB -> 107374182400 bytes
Previously, we defined syrk by checking I==J.
This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J
(it will happen when we'll split Q window and multiply different vertical bands of P)
… blocks for each MPI group, refactor compute_block_residues()

Fixes #203 bigint-syrk-blas: add --maxSharedMemory option to limit MPI shared memory window sizes
TODO: currently it fails if limit is to small. We should split input and output windows instead.
…ory limit

In unit tests, we test two cases:
- no memory limit (no P window splitting)
- memory limit ensuring that only 3 rows fit into P window.
see calculate_matrix_square.test.cxx

In end-to-end.test.cxx, we set --maxSharedMemory=1M for two realistic cases, thus enforcing split factors 4 and 6.
In other cases, limit is not set.

TODO: update Readme.md
TODO: split also Q (output) window, if necessary.
…culating total_size

Result is different when input_window_split_factor > 1.
…e --maxSharedMemory limit

TODO: update also bigint_syrk/Readme.md

Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting
In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q.

Also supported both uplo=UPPER and LOWER for syrk.
Fixed reduce_scatter(): Old version synchronized only upper half always, but for off-diagonal blocks Q_IJ, we need to synchronize all.
@vasdommes vasdommes added this to the 2.8.0 milestone Mar 8, 2024
@vasdommes vasdommes self-assigned this Mar 8, 2024
@vasdommes
Copy link
Collaborator Author

vasdommes commented Mar 8, 2024

Benchmarks for nmax=14 (from stress-tensors-3d project) show that splitting P window reduces memory usage without affecting performance. Performance degrades only if we split Q into many parts, probably because of residues of each P element have to be calculated multiple times.
(Note that setting the limit much smaller than the Q size makes sense only for extremely large jobs, where Q barely fits into a single machine.)

4 nodes, 128 cores per node, 256GB RAM

SDP sizes:

primal dimension: 90040
dual dimension: 3406
SDP blocks: 225
Total size of input P windows (on all nodes), without splitting: 176 GB (~44GB per node)
Output Q window size: 6.7GB

Benchmarks for timing run (2nd solver iteration) for different memory limits:

--maxSharedMemory 0GB (unlimited) 20GB 6GB 100MB
split factor for P 1 4 12 354
split factor for Q 1 1 2 10
RAM usage per node 143GB 94GB 81.8GB 79GB
Total iteration time 309s 311s 310s 429s
compute Q on node 149s 149s 153s 272s
reduce-scatter Q 107s 108s 104s 103s
other steps 53s 54s 53s 53s

@vasdommes
Copy link
Collaborator Author

For nmax=18, setting --maxSharedMemory=25GB (just above the size of Q window) allowed to reduce number of nodes from 10 to 4. Interestingly, total completion time remained almost the same.
Increasing number of nodes n makes all local computations faster (~1/n in perfect scaling regime), but reduce-scatter time grows linearly ~n.

SDP sizes:

primal dimension: 187328
dual dimension: 5458
SDP blocks: 369
Total size of input P windows (on all nodes), without splitting: 777 GB (~78GB per node)
Output Q window size: 22.6GB

image

@vasdommes
Copy link
Collaborator Author

vasdommes commented Mar 9, 2024

What I don't understand very well is extra memory usage associated with large windows.
For example, for nmax=18 without window splitting, total RAM usage across 10 nodes is ~2400GB (shared windows take ~1000GB in total).
Since one node has only 256GB RAM, we should need at least 6 nodes even if shared window size is close to zero.
However, with --maxShareMemory=25G we can fit into 4 nodes, i.e. ~1000GB (shared windows take ~100GB in total).

This means that the difference between two cases is ~900GB of shared memory (as expected), plus ~500GB of something else. Is it due to BLAS jobs?

P.S. For nmax=14, comparing the first and the last column, we see ~50GB/node difference for shared memory plus ~10GB/node of something else.

@vasdommes vasdommes merged commit 7b52f8b into bigint-syrk-blas Mar 9, 2024
2 checks passed
@vasdommes vasdommes deleted the window-split branch March 9, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant