Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel support to TpetraWrappers::Vector. #16290

Merged
merged 2 commits into from Jan 18, 2024

Conversation

kinnewig
Copy link
Contributor

@kinnewig kinnewig commented Nov 24, 2023

When I tried to use it in parallel, I discovered several bugs in the TpetraWrappers::Vector class. I bundled all fixes into this one pull request.

There are some changes I would like to highlight so that we can discuss them:

operator=

I have several issues with the original implementation:

  1. The existing implementation of the operator= does not work in parallel (see issue Regression tester regressed [baseline] #16286).
  2. It does not copy the complete content of the vector, which is not the behavior I would expect. (Here, I should highlight that the original implementation behaves similarly to the Epetra implementation. But still, this differs from what I would expect when I call the operator=.)
  3. There is a separation by cases, which isn't necessary. In one case, it calls the Tpetra::import() function without checking that the conditions are met to call that function. Moreover, import() requires communication between the ranks. So, I don't know if this is faster.

So, I propose making a deep copy of the right-hand side.

compress()/decompress()

So far, the compress() function has not been implemented. But for parallel usage, we need a way to communicate changes from one rank to another rank. But in contrast to the Epetra::Vector, the Tpetra::Vector can not write into off-processor entries and then communicate that later. The way to overcome that problem, I came up with is to introduce two maps:

  • owned_elements: containing a non-overlapping map of all elements of the vector. So, each rank can only access the values that belong to that rank.
  • relevant_elements: an overlapping map of all elements so each rank can access all elements that belong to that rank, as well as all necessary ghost elements.

We call a vector compressed if and only if the underlying map of the Tpetra::Vector is non-overlapping.
When we want to access a ghost element (that only lives in the relevant_elements map), we must first call decompress() (in case the vector is compressed).
And once we are done, we have to call compress() to communicate the changes between all ranks.

decompress() uses import() to distribute the internal TpetraVector from the owned_elements map to the relevant_elements map.

compress() uses export() to distribute the internal Tpetra::Vector from the relevant_elements map to the owned_elements map.

print()

I reworked the print() function to match the output from the Epetra-based TrilinosWrappers::MPI::Vector. This should make reusing tests from the Epetra-based vector for the Tpetra-based vector easier.
Since I made some changes to the print() function, I had to adapt the expected output from some already existing tests for the Tpetra-based vector.

Edit:
In the original draft of this pull request, I included an additional template parameter for Kokkos::Nodes.
I removed that template parameter from this pull request and will create later an individual pull request to add that additional template parameter.

Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments, but it's a holiday long weekend here so I'm not going to work too hard today :-)

I think that the operator= thing might require more thought. But if you want to break the changes that add a Node argument to the class out into a separate patch, that is pretty uncontroversial and could be merged separately.

include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
@bangerth
Copy link
Member

Let me know when you want me to take another look at this patch!

@kinnewig
Copy link
Contributor Author

Sorry for the delay; I was very busy the last few weeks, and on top of that, I was on vacation in between.
But I just continued working on this pull request yesterday.

First, Thank you, @bangerth, for the feedback, especially on the copy assignment operator.

Based on your feedback, I took another look into that. I restored the original behavior of the copy assignment operator, but now it also works in parallel. But I added a new assert to address at least my main concern.

This makes the decompress() function redundant; therefore, I removed it.

Moreover, I adopted a few assert messages from the (Epetra-based) Trilinos Vector class.

So, I would be happy if someone could take another look into the reworked changes.

include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.h Outdated Show resolved Hide resolved
Comment on lines 879 to +876
/**
* IndexSet of the elements of the last imported vector.
*/
::dealii::IndexSet source_stored_elements;
dealii::IndexSet source_stored_elements;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing, but I'm not fond of this variable name -- perhaps that can be addressed in a separate PR if you were willing?

Conceptually, objects live in the present. Perhaps they have been initialized, but they typically do not retain how they have been initialized. If they need to cache certain information, as I think you are doing here, maybe a better name would be to use last_ as a prefix instead of source_. But I may also be missing what it is you are doing here.

Copy link
Contributor Author

@kinnewig kinnewig Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a separate PR for this later.

include/deal.II/lac/trilinos_tpetra_vector.templates.h Outdated Show resolved Hide resolved
Vector<Number>::compress(const VectorOperation::values operation,
const IndexSet &locally_owned)
{
if (!compressed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made calling compress() a crime on vectors with ghost elements in #16378 Perhaps you can add the same assertion here, and then you don't need the if (!compressed).

Copy link
Contributor Author

@kinnewig kinnewig Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we have a clash of notation here:

Originally, I called a vector in this class compressed if and only if the vector can only read and write its locally owned elements. Only a non-compressed vector can read and write elements from other ranks.

As discussed above, I changed compressed to has_ghost_elements (where has_ghost_elements has the opposite meaning). And now it only makes sense to call compress() on vectors where has_ghost_elements == true...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's the other way around. Vectors with has_ghost_values == true are read-only. It does not make sense to call compress() on them.

Copy link
Contributor Author

@kinnewig kinnewig Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a nutshell: It does not hold true that:

compressed = (!has_ghost_elements)

I compared the vector class against the Epetra-based vector class. Both compressed and has_ghost_elements are used there. They have similar meanings, but it still does not work to use compressed = (!has_ghost_elements). (If you are interested, I can explain which conflict occurs).

Edit:
Writing it down helped me to gain a better understanding of this problem.

I now understand when I have to say if a vector has ghost elements or not for everything to work.
I just had to adapt two points in trilinos_tpetra_vector.templates.h.

Sorry for the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked into this together and found that the capabilities of the Epetra FEVector are no longer there.

So what we can no longer provide is a reinit(local_set,MPI_Comm,omit_zeroing_entries) as we need the nonlocal_vector to handle all incoming data as SumIntoGlobalValues will only work on locally owned global indices.

What we can do instead is actually use the vector_writable flag to construct a locally owned vector
as well as a nonlocal vector that stores all off-processor entries and communicates them during compress.

Sebastian is working on implementing this.
The question would now be what to do with the three argument reinit above.
We see three options

  • Just don't implement it (could make porting code harder to understand)
  • Implement it and throw an assertion (leads to completely undefined behaviour in release mode)
  • Implement it and throw an exception

I'm tending towards the last option. In the two 'implement' cases I would suggest the following text.

Due to differences between Tpetra and Epetra this capability is no longer supported.
To construct a locally owned vector use reinit(locally_owned,locally_relevant,mpi_communicator,/*writable_vector*/ true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kronbichler, thank you for the detailed feedback. And the FEVector history is a very interesting side mark!

I just pushed my latest changes, where I added the nonlocal_vector. The TpetraWrappers::Vector<Number> current state works quite similarly to the old Epetra-based vector class.

But at this point, it would be pretty easy to make the new Tpetra-based vector class more similar to the LinearAlgebra::distributed::Vector and use vector_is_ghosted in the same way as it is used in LA::d::Vector. This is mostly a design decision, therefore one of the principal developers has to say how we want to continue at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am obviously a bit biased and would feel comfortable with the LA::d::Vector setup, but I do also think other options could work. I think we should have a common understanding here, so additional opinions would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long-standing disagreement between @kronbichler and me that we've had a friendly exchange about for probably the better part of a decade :-)

My perspective is that it is useful to have a system in place that minimizes mistakes. Creating a vector so that it knows whether it can or cannot be written into is one such way, and it enables us to have assertions that check these semantics; it also avoids users to have to remember when they need to update ghost values. @kronbichler's perspective -- equally with merit -- is that we can come up with a system that does not require users to keep around two kinds of vectors, and that is perhaps more performant.

I recognize that it is unfair to you to have to choose between the positions of two long-time developers of the library. My personal preference would be if we could get the semantics of the Tpetra wrappers as close as possible to those of the Epetra wrappers, principally because we would like to replace the latter by the former with the minimal amount of changes necessary -- ideally, our users will never notice that the underlying technology has changed. I recognize that @kronbichler has good reasons to argue for switching to his preferred approach as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right,

I just pushed one last change, removing the now-unused locally_(owned|relevant)_map.
Now everything is ready from my side to be merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also fine with having a minimal interface change between Epetra and Tpetra, and have the vector take either state (ghosted or non-ghosted) at the point it is constructed/initialized. I think the underlying mechanism with nonlocal_vector and manual import/export is uncontroversial, so the main mechanism is in place should we ever want to re-work the vector interfaces in a direction where a deep copy of the data would not be necessary to go from the read-only/ghosted state to the writable/non-ghosted state. I think there is consensus between @bangerth and me that starting to write into a ghosted vector leads to all kinds of problems because data can and will get inconsistent, and so one should help the user avoid this situation. The other things are not parts where I feel strongly.

include/deal.II/lac/trilinos_tpetra_vector.templates.h Outdated Show resolved Hide resolved
include/deal.II/lac/trilinos_tpetra_vector.templates.h Outdated Show resolved Hide resolved
Comment on lines 780 to 787
VectorType dummy = VectorType(locally_owned_map.getConst());

Teuchos::RCP<const ExportType> exporter =
Tpetra::createExport(locally_relevant_map.getConst(),
locally_owned_map.getConst());
dummy.doExport(*vector, *exporter, tpetra_operation);

*vector = dummy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an expensive operation for compress(). Why is this necessary to make things work, given that that other vector types do not need to create such a temporary vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary to use a dummy vector here.

I learned this after I wrote that part of the class. I removed that dummy vector.

Again, thanks for pointing this out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have tested this before I answered; it seems like we need that dummy vector here.

It is not necessary in the Epetra-based vector class since a second vector, i.e., the nonlocal_vector, is already defined.

@bangerth
Copy link
Member

bangerth commented Jan 9, 2024

Like for the other patch, once you're happy with it, squash into a self-contained set of commits and let me know!

@bangerth
Copy link
Member

FYI, the 'indent' CI failure is this one:

/home/runner/work/dealii/dealii/include/deal.II/lac/trilinos_tpetra_vector.h:700: warning: argument 'operator' of command @param is not found in the argument list of LinearAlgebra::TpetraWrappers::Vector< Number >::compress(const VectorOperation::values operation)
/home/runner/work/dealii/dealii/include/deal.II/lac/trilinos_tpetra_vector.h:700: warning: The following parameter of LinearAlgebra::TpetraWrappers::Vector::compress(const VectorOperation::values operation) is not documented:
parameter 'operation'

@kinnewig kinnewig force-pushed the tpetra_vector_parallel branch 2 times, most recently from 76dfcea to 93825f2 Compare January 12, 2024 14:08
Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check all details, but looking through the code is a pleasure. I'm looking forward to have all functionality moved to Tpetra.

@kinnewig
Copy link
Contributor Author

Thank you for all the nice feedback!

From my side, the pull request can now be merged.

@kronbichler
Copy link
Member

All comments were addressed. Let me merge this now since we do not seem to get new comments; we can always make follow-up PRs with additional cleanups if we observe them. Thank you @kinnewig for the work!

@kronbichler kronbichler merged commit 08160c0 into dealii:master Jan 18, 2024
15 checks passed
@kinnewig kinnewig deleted the tpetra_vector_parallel branch February 18, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants