Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Parallelization #138

Merged
merged 138 commits into from
Aug 24, 2022
Merged

MPI Parallelization #138

merged 138 commits into from
Aug 24, 2022

Conversation

FG-TUM
Copy link
Collaborator

@FG-TUM FG-TUM commented May 24, 2022

Description

Adds support for MPI Parallelization. Decomposition is a simple regular grid decomposition.

  • Update AutoPas version
  • Restrict CLI output to rank 0
  • Logger to visualize decomposition
  • Add initial particles to correct rank
  • new ID system
  • Particle communication
    • Particle serialization
    • Sending / receiving of migrating particles
    • Sending / receiving of halo particles (see comment)
    • Collision detection of migrants
  • VTK Output
  • Runtime Summary (comment)
  • HDF5 Output (Currently one file per rank)
  • Constellations
    • IDs
    • insertion into correct rank
    • Tests
  • Checkpoints
    • Tests
  • Tests
    • MPI parallel tests
    • update ValidationTest (VTUWriter changed)
  • Update README
  • Run experiments
    • small for validation
    • big for speed (with both decomps)

New ID System

Ideas:

  1. Split value range for IDs into sections for each rank. Then each rank can assign IDs as needed and we would notice if there would be an id collision. Such collisions would be unlikely since even for 1 mio ranks we could assign 2^64 / 10^6 = 1.8*10^13 ids per rank.
  2. Assign IDs incrementally as now and have global communication that coordinates the distribution of new IDs in the event of a collision or constellation insertion.

Optional TODOs

If these are not addressed in this PR create issues for them.

  • Change Decomposition logger from std::ofstream to spdlog::basic_logger_mt<spdlog::async_factory>.
  • New heuristic for CSF
  • Reavaluate tuning

Related Pull Requests

Resolved Issues

How Has This Been Tested?

  • TODO

@FG-TUM FG-TUM added the Enhancement New feature or request label May 24, 2022
@FG-TUM FG-TUM self-assigned this May 24, 2022
@FG-TUM
Copy link
Collaborator Author

FG-TUM commented May 25, 2022

@gomezzz Regarding the runtime summary in the end: what to we want there for each entry?

  • Sum of times of all ranks?
  • Min/Max
  • Average (arithmetic, geometric, median, ....)
  • ...?

In my opinion, the max of each value would give us the most insight?
I think looking at the sums is interesting because then the percentages are adding up correctly?

@gomezzz
Copy link
Collaborator

gomezzz commented Aug 8, 2022

@gomezzz

  • Try out constellations to see if it works or is broken

@gomezzz
Copy link
Collaborator

gomezzz commented Aug 12, 2022

@FG-TUM As discussed I tried out constellations just uncommenting the respective line in the default config. Currently segfaults on this branch with

EDIT: Also happens without MPI, so constellations are currently broken, I guess?

terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_replace_aux
[thinkpad:11910] *** Process received signal ***
[thinkpad:11910] Signal: Aborted (6)
[thinkpad:11910] Signal code:  (-6)
[thinkpad:11910] [ 0] /usr/lib/libc.so.6(+0x3e8e0)[0x7f4b712568e0]
[thinkpad:11910] [ 1] /usr/lib/libc.so.6(+0x8e36c)[0x7f4b712a636c]
[thinkpad:11910] [ 2] /usr/lib/libc.so.6(raise+0x18)[0x7f4b71256838]
[thinkpad:11910] [ 3] /usr/lib/libc.so.6(abort+0xcf)[0x7f4b71240535]
[thinkpad:11910] [ 4] /usr/lib/libstdc++.so.6(+0x99833)[0x7f4b715c4833]
[thinkpad:11910] [ 5] /usr/lib/libstdc++.so.6(+0xa5bfc)[0x7f4b715d0bfc]
[thinkpad:11910] [ 6] /usr/lib/libstdc++.so.6(+0xa5c69)[0x7f4b715d0c69]
[thinkpad:11910] [ 7] /usr/lib/libstdc++.so.6(+0xa5ecd)[0x7f4b715d0ecd]
[thinkpad:11910] [ 8] /usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x7f4b715c74ce]
[thinkpad:11910] [ 9] /usr/lib/libstdc++.so.6(+0x14bf6c)[0x7f4b71676f6c]
[thinkpad:11910] [10] ./src/ladds/ladds(+0xb780d)[0x55ec2fb3380d]
[thinkpad:11910] [11] ./src/ladds/ladds(+0x142519)[0x55ec2fbbe519]
[thinkpad:11910] [12] ./src/ladds/ladds(+0x140094)[0x55ec2fbbc094]
[thinkpad:11910] [13] ./src/ladds/ladds(+0xc0549)[0x55ec2fb3c549]
[thinkpad:11910] [14] ./src/ladds/ladds(+0x76e01)[0x55ec2faf2e01]
[thinkpad:11910] [15] ./src/ladds/ladds(+0x13d598)[0x55ec2fbb9598]
[thinkpad:11910] [16] /usr/lib/libc.so.6(+0x29290)[0x7f4b71241290]
[thinkpad:11910] [17] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7f4b7124134a]
[thinkpad:11910] [18] ./src/ladds/ladds(+0x6cab5)[0x55ec2fae8ab5]
[thinkpad:11910] *** End of error message ***
[2022-08-12 10:57:49.318] [laddsLog] [Rank 0] [info] Min altitude is 6562.864112916569
[2022-08-12 10:57:49.318] [laddsLog] [Rank 0] [info] Max altitude is 8368.637769169021
[2022-08-12 10:57:49.318] [laddsLog] [Rank 0] [info] Number of particles: 16024
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_replace_aux
[thinkpad:11909] *** Process received signal ***
[thinkpad:11909] Signal: Aborted (6)
[thinkpad:11909] Signal code:  (-6)
[thinkpad:11909] [ 0] /usr/lib/libc.so.6(+0x3e8e0)[0x7f523d3228e0]
[thinkpad:11909] [ 1] /usr/lib/libc.so.6(+0x8e36c)[0x7f523d37236c]
[thinkpad:11909] [ 2] /usr/lib/libc.so.6(raise+0x18)[0x7f523d322838]
[thinkpad:11909] [ 3] /usr/lib/libc.so.6(abort+0xcf)[0x7f523d30c535]
[thinkpad:11909] [ 4] /usr/lib/libstdc++.so.6(+0x99833)[0x7f523d690833]
[thinkpad:11909] [ 5] /usr/lib/libstdc++.so.6(+0xa5bfc)[0x7f523d69cbfc]
[thinkpad:11909] [ 6] /usr/lib/libstdc++.so.6(+0xa5c69)[0x7f523d69cc69]
[thinkpad:11909] [ 7] /usr/lib/libstdc++.so.6(+0xa5ecd)[0x7f523d69cecd]
[thinkpad:11909] [ 8] /usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x7f523d6934ce]
[thinkpad:11909] [ 9] /usr/lib/libstdc++.so.6(+0x14bf6c)[0x7f523d742f6c]
[thinkpad:11909] [10] ./src/ladds/ladds(+0xb780d)[0x55bf8534f80d]
[thinkpad:11909] [11] ./src/ladds/ladds(+0x142519)[0x55bf853da519]
[thinkpad:11909] [12] ./src/ladds/ladds(+0x140094)[0x55bf853d8094]
[thinkpad:11909] [13] ./src/ladds/ladds(+0xc0549)[0x55bf85358549]
[thinkpad:11909] [14] ./src/ladds/ladds(+0x76e01)[0x55bf8530ee01]
[thinkpad:11909] [15] ./src/ladds/ladds(+0x13d598)[0x55bf853d5598]
[thinkpad:11909] [16] /usr/lib/libc.so.6(+0x29290)[0x7f523d30d290]
[thinkpad:11909] [17] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7f523d30d34a]
[thinkpad:11909] [18] ./src/ladds/ladds(+0x6cab5)[0x55bf85304ab5]
[thinkpad:11909] *** End of error message ***
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_replace_aux
[thinkpad:11911] *** Process received signal ***
[thinkpad:11911] Signal: Aborted (6)
[thinkpad:11911] Signal code:  (-6)
[thinkpad:11911] [ 0] /usr/lib/libc.so.6(+0x3e8e0)[0x7fbbb386d8e0]
[thinkpad:11911] [ 1] /usr/lib/libc.so.6(+0x8e36c)[0x7fbbb38bd36c]
[thinkpad:11911] [ 2] /usr/lib/libc.so.6(raise+0x18)[0x7fbbb386d838]
[thinkpad:11911] [ 3] /usr/lib/libc.so.6(abort+0xcf)[0x7fbbb3857535]
[thinkpad:11911] [ 4] /usr/lib/libstdc++.so.6(+0x99833)[0x7fbbb3bdb833]
[thinkpad:11911] [ 5] /usr/lib/libstdc++.so.6(+0xa5bfc)[0x7fbbb3be7bfc]
[thinkpad:11911] [ 6] /usr/lib/libstdc++.so.6(+0xa5c69)[0x7fbbb3be7c69]
[thinkpad:11911] [ 7] /usr/lib/libstdc++.so.6(+0xa5ecd)[0x7fbbb3be7ecd]
[thinkpad:11911] [ 8] /usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x7fbbb3bde4ce]
[thinkpad:11911] [ 9] /usr/lib/libstdc++.so.6(+0x14bf6c)[0x7fbbb3c8df6c]
[thinkpad:11911] [10] ./src/ladds/ladds(+0xb780d)[0x556bca2ba80d]
[thinkpad:11911] [11] ./src/ladds/ladds(+0x142519)[0x556bca345519]
[thinkpad:11911] [12] ./src/ladds/ladds(+0x140094)[0x556bca343094]
[thinkpad:11911] [13] ./src/ladds/ladds(+0xc0549)[0x556bca2c3549]
[thinkpad:11911] [14] ./src/ladds/ladds(+0x76e01)[0x556bca279e01]
[thinkpad:11911] [15] ./src/ladds/ladds(+0x13d598)[0x556bca340598]
[thinkpad:11911] [16] /usr/lib/libc.so.6(+0x29290)[0x7fbbb3858290]
[thinkpad:11911] [17] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7fbbb385834a]
[thinkpad:11911] [18] ./src/ladds/ladds(+0x6cab5)[0x556bca26fab5]
[thinkpad:11911] *** End of error message ***
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_replace_aux
[thinkpad:11908] *** Process received signal ***
[thinkpad:11908] Signal: Aborted (6)
[thinkpad:11908] Signal code:  (-6)
[thinkpad:11908] [ 0] /usr/lib/libc.so.6(+0x3e8e0)[0x7fd710e9f8e0]
[thinkpad:11908] [ 1] /usr/lib/libc.so.6(+0x8e36c)[0x7fd710eef36c]
[thinkpad:11908] [ 2] /usr/lib/libc.so.6(raise+0x18)[0x7fd710e9f838]
[thinkpad:11908] [ 3] /usr/lib/libc.so.6(abort+0xcf)[0x7fd710e89535]
[thinkpad:11908] [ 4] /usr/lib/libstdc++.so.6(+0x99833)[0x7fd71120d833]
[thinkpad:11908] [ 5] /usr/lib/libstdc++.so.6(+0xa5bfc)[0x7fd711219bfc]
[thinkpad:11908] [ 6] /usr/lib/libstdc++.so.6(+0xa5c69)[0x7fd711219c69]
[thinkpad:11908] [ 7] /usr/lib/libstdc++.so.6(+0xa5ecd)[0x7fd711219ecd]
[thinkpad:11908] [ 8] /usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x44)[0x7fd7112104ce]
[thinkpad:11908] [ 9] /usr/lib/libstdc++.so.6(+0x14bf6c)[0x7fd7112bff6c]
[thinkpad:11908] [10] ./src/ladds/ladds(+0xb780d)[0x56553458280d]
[thinkpad:11908] [11] ./src/ladds/ladds(+0x142519)[0x56553460d519]
[thinkpad:11908] [12] ./src/ladds/ladds(+0x140094)[0x56553460b094]
[thinkpad:11908] [13] ./src/ladds/ladds(+0xc0549)[0x56553458b549]
[thinkpad:11908] [14] ./src/ladds/ladds(+0x76e01)[0x565534541e01]
[thinkpad:11908] [15] ./src/ladds/ladds(+0x13d598)[0x565534608598]
[thinkpad:11908] [16] /usr/lib/libc.so.6(+0x29290)[0x7fd710e8a290]
[thinkpad:11908] [17] /usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7fd710e8a34a]
[thinkpad:11908] [18] ./src/ladds/ladds(+0x6cab5)[0x565534537ab5]
[thinkpad:11908] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node thinkpad exited on signal 6 (Aborted).

@gomezzz
Copy link
Collaborator

gomezzz commented Aug 12, 2022

Also, in the CI we are getting the same error I mentioned once before which seems to occur occasionally depending on the number of generated fragments in the breakup 🤔

 Note: Google Test filter = BreakupWrapperTest.testSimulationLoop
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from BreakupWrapperTest
[ RUN      ] BreakupWrapperTest.testSimulationLoop
LLVMSymbolizer: error reading file: No such file or directory
Warning: 11 08:53:07.499] [warning] The simulation reduced the number of fragments because the mass budget was exceeded. In other words: The random behaviour has produced heavier fragments
Warning: 11 08:53:07.500] [warning] The fragment count was reduced from 442 to 259 fragments.
==================
WARNING: ThreadSanitizer: data race (pid=9731)

@gomezzz
Copy link
Collaborator

gomezzz commented Aug 15, 2022

@FG-TUM Please add a warning that constellations are broken :)

@gomezzz gomezzz mentioned this pull request Aug 16, 2022
1 task
@gomezzz gomezzz merged commit 59c7ad3 into main Aug 24, 2022
@gomezzz gomezzz deleted the mpiMadness branch August 24, 2022 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI Support
2 participants