New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable data sharing across processes #4086
Conversation
No. My vote would be to wait with merging this until after the 2.3 release as it is not tested at this point. |
Cherry pick the cleanups in #4086
I agree. This is pretty useful functionality, but I am afraid of subtle bugs. Also if we merge this when requiring 9.3 we could avoid the ifdefs. |
OK. Do any of you have a setup running regularly where this could be tested? I'll think of a test in the next few days, but I can't test on more than one node (but multiple processes) since I don't have access to any such systems on a regular basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, I only have two smaller comments.
About the test: Maybe @alarshi has a setup that uses an ascii data file that is somewhat large? Otherwise we have lots of ascii_data tests, but all of them use very small tables.
include/aspect/structured_data.h
Outdated
* The `coordinate_values` and `data_table` arguments are rvalue references, | ||
* and the function will move the data so provided into another storage | ||
* location. In other words, after the call, the variables passed as the | ||
* last two arguments may be empty or otherwise altered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not the last arguments any longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gassmoeller: Yes, I can test it using the global tomography Ascii data, we input in our models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to talk to @bangerth first, I think the purpose is not only to find out if the code works, but also if it actually only stores a single copy per node (instead of a single copy per process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be trivial to see if we actually need less memory (not sure how this shows up in the top, maybe under "shared"). But running it with 2 nodes to see if it works (including the check that the values are set on each rank!), would be good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, we will need to check with Intelmpi, openmpi and mpich.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already established that it stores only one copy with the underlying deal.II patches. I would love for someone to just try a larger model without and with this patch just to make sure that we get the same results. @alarshi: If that is something that would be easy for you to try, that would be fantastic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bangerth: Absolutely! So, is the goal to run the test model using 2 nodes and then compare the difference with and without this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be totally sufficient to convince ourselves that it works. Thanks for giving it a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bangerth: I am getting segmentation fault when running my models with the branch. The error message is:
output_screen.txt
I checked that the model is running with the current master.
The input prm file and the ascii data I used are here: https://www.dropbox.com/sh/kxr9p2cw1j9ysjo/AAAvCZJ8Jhull2zLpUES5HZEa?dl=0
include/aspect/structured_data.h
Outdated
const MPI_Comm &mpi_communicator, | ||
const unsigned int root_process); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you provide defaults for these parameters? That way we stay backward compatible with user plugins that use this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, what should the default values be? I could give MPI_COMM_SELF
to the former and the special value numbers::invalid_unsigned_int
to the latter. If a user just calls the old version with both arguments defaulted, that works. If a user calls the new version with both arguments given, that also works. This scheme also seems to work if someone uses a provided communicator and defaults the last argument -- in that case, the communicator is simply ignored, which is also what is documented.
So I think I just talked myself through doing just that! :-)
@alarshi Thanks for providing me with a testcase! I've had to update a couple more places where we store duplicate information and where all processes needed to be given a copy of the data. This now works on my workstation. Would you be willing to run the same test case you gave me again on a larger machine? |
@bangerth: Absolutely! I just submitted the job for the model on HPC, will keep you updated on the results :) |
@bangerth: The test ran on the HPC, but I'm unable to render the model results in ParaView. It gives me an error as "byte index ... not well formed". I am running the model again to see if it is reproducible. |
That sounds unrelated. How do the other statistics ASPECT outputs compare with/without this patch? Do you get the same number of iterations, the same average velocity, etc? |
@bangerth: The number of iterations and the average velocity are the same between the two branches. Here are the two log files: |
Excellent, that's good news -- thanks very much for checking, @alarshi ! What does everyone else think? |
Looks good to me. Can you rebase and add a changelog entry? |
d6d8933
to
2a2fbd1
Compare
22716ef
to
339faac
Compare
Given that dealii/dealii#12606 has been merged, I've updated this PR to require deal.II 10.0 (i.e., the current dev version). This should now work, and I've examined this by running one of the tests that use ASCIIData on two processes with and without the patch, checking that we get the same output. |
b528f82
to
a26e366
Compare
OK, this was frustrating because I had introduced a bug in deal.II three weeks ago that required a patch, and there was one other bug that was exposed by the unit tests (but in a code path that is, I believe, not actually used in ASPECT itself). In any case, the deal.II bug has a patch in dealii/dealii#13361, and with this patch applied, the version of this patch works locally on my hard drive. It will still fail the CI check with the deal.II dev version until that container uses a patched deal.II. But I'm reasonably sure that this is now correct and does what is expected. |
Can you fix the conflicts, please? It should be good to go now. |
This is likely the time to finish this PR :-) |
It's going to take a bit to get that rebased :-( |
Well, that actually turned out to be quite the pain. In the end, I first reverted two of @jdannberg 's patches, then cherry-picked all of my patches, and then re-applied the two patches. If you'd like to follow along (or at the very least to document this), here's what I did:
I am pretty sure that I did this right, but the tester will show for sure. /rebuild |
That proves the point that even small indentation changes can be disruptive... |
The indentation change was the least of my concerns -- it was only one line in this file that was affected. @jdannberg's patches were a real hassle, though.. Anyway, it tests ok now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to go to me. Let's get this merged when the testers are done.
* location. In other words, after the call, the variables passed as the | ||
* second and third arguments may be empty or otherwise altered. | ||
* | ||
#if DEAL_II_VERSION_GTE(9,4,0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you put this ifdef inside the comment? I don't think this will work and it makes sense to have both parts in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doxygen runs (some variation of) the preprocessor before ingesting the sources. That's how our #ifndef DOXYGEN
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But macros don't work inside a comment! See https://aspect.geodynamics.org/doc/doxygen/classaspect_1_1Utilities_1_1StructuredDataLookup.html#a7a5625b138bc39bf4e53e9be4d9b3314
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. On it -- see #4789.
Let's sort out the documentation later in order to avoid further conflicts with the other StructuredData PRs. |
This builds on #4083, so the first patch is duplicated from there (but because it's identical, we can still merge this assuming #4083 were merged).
I believe this should work, but I don't know for sure and it's late and I'm leaving town tomorrow afternoon until Monday. Let us see what the testers say -- they will at least tell us whether it works with deal.II 9.2. Do any of the testers run 9.3?
The real test would be if (i) we have tests that use ASCII data tables in parallel, and (ii) we check with deal.II 9.3. I don't know whether we have any that satisfy (i). Anyone know for sure?
/rebuild