Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect block mapping for cyclic MPI job distribution across nodes #166

Closed
vasdommes opened this issue Dec 28, 2023 · 0 comments
Closed
Assignees
Labels
Milestone

Comments

@vasdommes
Copy link
Collaborator

When assigning blocks, SDPB assumes that first procsPerNode MPI ranks belong to the first node.
In SLURM, this corresponds to --distribution=block option for srun, see https://slurm.schedmd.com/srun.html#OPT_distribution

For other distributions, this is not true. For example, if --distribution=cyclic option is set instead, then first rank goes to the first node, second rank - to the second node etc.

In that case, if some block is shared among rank 0 and rank 1, its data is spread among different nodes.

For SDPB 2.6.1, this could lead to slow DistMatrix operations (e.g. cholesky decomposition) due to slow communication between nodes.
For future versions using shared memory window (e.g. after merging #142), this will lead to errors and/or failures.

@vasdommes vasdommes added the bug label Dec 28, 2023
@vasdommes vasdommes added this to the 2.7.0 milestone Dec 28, 2023
@vasdommes vasdommes self-assigned this Dec 28, 2023
vasdommes added a commit that referenced this issue Jan 3, 2024
…ss nodes

Use ranks from node communicator instead of global ranks (which can be assigned to nodes in different ways).
vasdommes added a commit that referenced this issue Jan 3, 2024
…ss nodes

Use ranks from node communicator instead of global ranks (which can be assigned to nodes in different ways).
@vasdommes vasdommes modified the milestones: 2.7.0, 2.6.2 Jan 4, 2024
vasdommes added a commit that referenced this issue Jan 5, 2024
Fix #166 Incorrect block mapping for cyclic MPI job distribution across nodes
bharathr98 pushed a commit to bharathr98/sdpb that referenced this issue Mar 1, 2024
…on across nodes

Use ranks from node communicator instead of global ranks (which can be assigned to nodes in different ways).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant