-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MPI groups to parallelize data assimilation #252
Conversation
source/ensemble_management.hh
Outdated
* normal distribution of average @p mean and standard deviation @stddev. The | ||
* first @p n_rejected_draws are rejected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a minute to realize the rationale for throwing away a predetermined number of draws. The idea is to get consistent fills of the vectors regardless of the number of global ranks, right? That's niche enough that a comment would help. Originally I was trying to sort out if throwing away initial draws gets higher quality pseudorandom numbers from the PRNG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right the idea is that it shouldn't matter if you use one MPI rank or ten, the ensemble members should use the same random numbers. If we don't reject the first few draws. all the processors will use the same "random" numbers since they have the same seed.
I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right -- even worse than I was thinking. Thanks for adding the comment.
// We need to gather the augmented_state_ensemble from the other processors. | ||
// BlockVector is a complex structure with its own communicator and so we | ||
// cannot simple use dealii's gather to perform the communication. Instead, we | ||
// extract the locally owned data and gather it to processor zero. We do this | ||
// in a two step process. First, we move the data to local rank zero. Second, | ||
// we move the data to global rank zero. The first step allows to simply move | ||
// complete vectors to global rank zero. Otherwise, we have the vector data | ||
// divided in multiple chunks when gathered on global rank zero and we have to | ||
// reconstruct the vector. Finally we can build new BlockVector using the | ||
// local communicator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively this seems like something that could be in it's own function (either still as a member of DataAssimilator or not). What is your reasoning for putting it directly in this function? That we won't be keeping it forever since eventually we'd like to do distributed DA? That this is the only place we expect to use/need this functionality? Something else?
I'm not suggesting that it has to be moved, I just want to better understand the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking for putting everything in the function was that during the gather step, I will need some data to know which processor owns which elements and I can reuse this data during the scatter phase. I wrote this code a while back so I don't know if it's still true. I am fine moving this code to two functions if you think it improves the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can move that code out pretty cleanly and the recalculation isn't going to be really expensive, then I think moving it out would improve the code. If one of those isn't true, a comment explaining it would be ok.
With this PR, we can finally run data assimilation in parallel. We can use up to as many processors as there are ensemble members. The way this is done is by splitting the global MPI communicator (we use MPI_COMM_WORLD) in multiple local communicators and then using these local communicators where needed. A lot of the PR is just about choosing the right communicator. The data assimilation itself is still done in serial on rank 0. This means that all the vectors need to be move to the rank 0 and out of it after they have been updated. There are two difficulties with this operation: