Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable data sharing across processes #4086

Merged
merged 12 commits into from May 22, 2022
Merged

Conversation

bangerth
Copy link
Contributor

This builds on #4083, so the first patch is duplicated from there (but because it's identical, we can still merge this assuming #4083 were merged).

I believe this should work, but I don't know for sure and it's late and I'm leaving town tomorrow afternoon until Monday. Let us see what the testers say -- they will at least tell us whether it works with deal.II 9.2. Do any of the testers run 9.3?

The real test would be if (i) we have tests that use ASCII data tables in parallel, and (ii) we check with deal.II 9.3. I don't know whether we have any that satisfy (i). Anyone know for sure?

/rebuild

@tjhei
Copy link
Member

tjhei commented Jun 23, 2021

Do any of the testers run 9.3?

No.

My vote would be to wait with merging this until after the 2.3 release as it is not tested at this point.

gassmoeller added a commit that referenced this pull request Jun 23, 2021
@gassmoeller
Copy link
Member

My vote would be to wait with merging this until after the 2.3 release as it is not tested at this point.

I agree. This is pretty useful functionality, but I am afraid of subtle bugs. Also if we merge this when requiring 9.3 we could avoid the ifdefs.

@bangerth
Copy link
Contributor Author

bangerth commented Jul 1, 2021

OK. Do any of you have a setup running regularly where this could be tested? I'll think of a test in the next few days, but I can't test on more than one node (but multiple processes) since I don't have access to any such systems on a regular basis.

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, I only have two smaller comments.

About the test: Maybe @alarshi has a setup that uses an ascii data file that is somewhat large? Otherwise we have lots of ascii_data tests, but all of them use very small tables.

* The `coordinate_values` and `data_table` arguments are rvalue references,
* and the function will move the data so provided into another storage
* location. In other words, after the call, the variables passed as the
* last two arguments may be empty or otherwise altered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not the last arguments any longer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gassmoeller: Yes, I can test it using the global tomography Ascii data, we input in our models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to talk to @bangerth first, I think the purpose is not only to find out if the code works, but also if it actually only stores a single copy per node (instead of a single copy per process).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be trivial to see if we actually need less memory (not sure how this shows up in the top, maybe under "shared"). But running it with 2 nodes to see if it works (including the check that the values are set on each rank!), would be good to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, we will need to check with Intelmpi, openmpi and mpich.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already established that it stores only one copy with the underlying deal.II patches. I would love for someone to just try a larger model without and with this patch just to make sure that we get the same results. @alarshi: If that is something that would be easy for you to try, that would be fantastic!

Copy link
Contributor

@alarshi alarshi Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bangerth: Absolutely! So, is the goal to run the test model using 2 nodes and then compare the difference with and without this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be totally sufficient to convince ourselves that it works. Thanks for giving it a try!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bangerth: I am getting segmentation fault when running my models with the branch. The error message is:
output_screen.txt
I checked that the model is running with the current master.
The input prm file and the ascii data I used are here: https://www.dropbox.com/sh/kxr9p2cw1j9ysjo/AAAvCZJ8Jhull2zLpUES5HZEa?dl=0

Comment on lines 123 to 124
const MPI_Comm &mpi_communicator,
const unsigned int root_process);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide defaults for these parameters? That way we stay backward compatible with user plugins that use this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what should the default values be? I could give MPI_COMM_SELF to the former and the special value numbers::invalid_unsigned_int to the latter. If a user just calls the old version with both arguments defaulted, that works. If a user calls the new version with both arguments given, that also works. This scheme also seems to work if someone uses a provided communicator and defaults the last argument -- in that case, the communicator is simply ignored, which is also what is documented.

So I think I just talked myself through doing just that! :-)

@bangerth
Copy link
Contributor Author

@alarshi Thanks for providing me with a testcase! I've had to update a couple more places where we store duplicate information and where all processes needed to be given a copy of the data. This now works on my workstation. Would you be willing to run the same test case you gave me again on a larger machine?

@alarshi
Copy link
Contributor

alarshi commented Jul 15, 2021

@bangerth: Absolutely! I just submitted the job for the model on HPC, will keep you updated on the results :)

@alarshi
Copy link
Contributor

alarshi commented Jul 16, 2021

@bangerth: The test ran on the HPC, but I'm unable to render the model results in ParaView. It gives me an error as "byte index ... not well formed". I am running the model again to see if it is reproducible.

@bangerth
Copy link
Contributor Author

That sounds unrelated. How do the other statistics ASPECT outputs compare with/without this patch? Do you get the same number of iterations, the same average velocity, etc?

@alarshi
Copy link
Contributor

alarshi commented Jul 19, 2021

@bangerth: The number of iterations and the average velocity are the same between the two branches. Here are the two log files:
master-log.txt
bangerth-log.txt

@bangerth
Copy link
Contributor Author

Excellent, that's good news -- thanks very much for checking, @alarshi !

What does everyone else think?

@gassmoeller
Copy link
Member

Looks good to me. Can you rebase and add a changelog entry?

@bangerth
Copy link
Contributor Author

Given that dealii/dealii#12606 has been merged, I've updated this PR to require deal.II 10.0 (i.e., the current dev version). This should now work, and I've examined this by running one of the tests that use ASCIIData on two processes with and without the patch, checking that we get the same output.

@bangerth
Copy link
Contributor Author

OK, this was frustrating because I had introduced a bug in deal.II three weeks ago that required a patch, and there was one other bug that was exposed by the unit tests (but in a code path that is, I believe, not actually used in ASPECT itself). In any case, the deal.II bug has a patch in dealii/dealii#13361, and with this patch applied, the version of this patch works locally on my hard drive.

It will still fail the CI check with the deal.II dev version until that container uses a patched deal.II. But I'm reasonably sure that this is now correct and does what is expected.

@tjhei
Copy link
Member

tjhei commented Feb 19, 2022

Can you fix the conflicts, please? It should be good to go now.

@gassmoeller
Copy link
Member

This is likely the time to finish this PR :-)

@bangerth
Copy link
Contributor Author

It's going to take a bit to get that rebased :-(

@bangerth
Copy link
Contributor Author

Well, that actually turned out to be quite the pain. In the end, I first reverted two of @jdannberg 's patches, then cherry-picked all of my patches, and then re-applied the two patches. If you'd like to follow along (or at the very least to document this), here's what I did:

I am pretty sure that I did this right, but the tester will show for sure.

/rebuild

@tjhei
Copy link
Member

tjhei commented May 22, 2022

That proves the point that even small indentation changes can be disruptive...

@bangerth
Copy link
Contributor Author

The indentation change was the least of my concerns -- it was only one line in this file that was affected. @jdannberg's patches were a real hassle, though..

Anyway, it tests ok now.

Copy link
Member

@gassmoeller gassmoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go to me. Let's get this merged when the testers are done.

* location. In other words, after the call, the variables passed as the
* second and third arguments may be empty or otherwise altered.
*
#if DEAL_II_VERSION_GTE(9,4,0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you put this ifdef inside the comment? I don't think this will work and it makes sense to have both parts in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen runs (some variation of) the preprocessor before ingesting the sources. That's how our #ifndef DOXYGEN works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. On it -- see #4789.

@gassmoeller
Copy link
Member

Let's sort out the documentation later in order to avoid further conflicts with the other StructuredData PRs.

@gassmoeller gassmoeller merged commit d3a8224 into geodynamics:main May 22, 2022
@bangerth bangerth deleted the move-2 branch May 23, 2022 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants