Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add LinearAlgebra::SharedMPI::Vector #10872

Closed
wants to merge 62 commits into from

Conversation

peterrum
Copy link
Member

@peterrum peterrum commented Sep 1, 2020

LinearAlgebra::SharedMPI::Vector

This PR adds a new vector LinearAlgebra::SharedMPI::Vector class that is built around MPI 3.0 shared-memory features.

MPI 3.0 provides functions to allocate memory for an array with the following command:

MPI_Win_allocate_shared(size, sizeof(Number), info, comm_shared, &data_this, win);

and to query a pointer to the beginning of the array owned by processes on the shared-memory domain:

MPI_Win_shared_query(*win, i, &ssize, &disp_unit, &data.others[i]);

The command to create a communicator comm_shared is:

MPI_Comm_split_type(
        comm, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL, &comm_shared);

These functions allow us to access all vector entries on the same compute node (same shared-memory domain) also in a purely MPI-parallelized program.

For more implementation details, see Section 4 and Appendix A of the pre-print of the hyper.deal release paper (https://arxiv.org/abs/2002.08110).

LinearAlgebra::SharedMPI::PartitionerBase

A second feature of the new vector class is that users can pass their own partitioner implementation. The first implementation LinearAlgebra::SharedMPI::Partitioner is just like Utilities::MPI::Partitioner (built around IndexSet) but splits up (internally) export_to_ghost_array and import_from_ghost_array in two steps: one for remote data (which requires MPI_Send/MPI_Recv) and one for shared data (which copied via memcpy if buffering is requested).

The interface should look familiar to that of Utilities::MPI::Partitioner with the difference that we are working with vectors of pointers (which are returned by MPI_Win_shared_query).

In hyper.deal, we are using our own partitioner implementation, which is specially tailored for DG.

Motivation

In hyper.deal, we reached a speed up of 25-30% on a single compute node (with ECL) by not using MPI_Send/MPI_Recv and in CEED BP1, a speed up up to 10% for the operator evaluation. Our hope is to reach better results with DG.

Next steps

In the next step, we will extend the support of MatrixFree for the new vector class, in particular, so that we can use it for DG. Furthermore, we would like to investigate partitioner strategies within MatrixFree, which requires some internal changes - so let's keep that out of this PR!

Note: I still need to clean up the code and add some documentation. Nevertheless, a preliminary feedback would be push appretated! Maybe someone has suggestions how to integrate the new classes better in the existing deal.II concepts.

Related to hyperdeal/hyperdeal#18 and kronbichler/ceed_benchmarks_dealii#7.

Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have collected some random comments. I agree with the basic concepts because I have been discussing them with @peterrum before, but it would be good with some additional comments.

include/deal.II/lac/la_sm_partitioner.h Outdated Show resolved Hide resolved
Comment on lines +29 to +34
namespace LinearAlgebra
{
namespace SharedMPI
{
/**
* Partitioner base class to be used in context of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespaces are a bit of a mess; the generic LA::dist::Vector has its partitioner in base; maybe we should move that here or this one there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest? To move all the partitioners here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can leave things here for the moment until someone feels a need to clean up things. I really like having the partitioner file to be named similarly and in the same place. For the basic partitioner that ship has sailed (well, we can always deprecate things, but I don't think it is worth it).

include/deal.II/lac/la_sm_partitioner.h Outdated Show resolved Hide resolved
include/deal.II/lac/la_sm_partitioner.h Outdated Show resolved Hide resolved
include/deal.II/lac/la_sm_partitioner.h Outdated Show resolved Hide resolved
include/deal.II/lac/la_sm_vector.h Outdated Show resolved Hide resolved
Comment on lines +637 to +656
/**
* Check whether the given partitioner is compatible with the
* partitioner used for this vector.
*
* @note Not implemented yet.
*/
bool
partitioners_are_compatible(
const Utilities::MPI::Partitioner &part) const;

/**
* Check whether the given partitioner is compatible with the
* partitioner used for this vector.
*
* @note Not implemented yet.
*/
bool
partitioners_are_globally_compatible(
const Utilities::MPI::Partitioner &part) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. MatrixFree will complain else (if I remember correctly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class does some checks depending on where the type traits of the vector send down the evaluator. I do not feel strongly here, so we can leave things in this state for the moment until we have all members filled.

Utilities::MPI::this_mpi_process(comm_shared);

MPI_Win *win = new MPI_Win;
Number * data_this = (Number *)malloc(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a question mark because I saw memory leaks on some MPI implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed this. According to the source code of Open MPI

    *((void**) baseptr) = base;

see also https://github.com/open-mpi/ompi/blob/6c46da32454553a52c6b0c30cae8d0075c43cd94/ompi/win/win.c#L323, the function MPI_Win_allocate_shared expects a void**, i.e, in our case &data_this...

include/deal.II/lac/la_sm_vector.templates.h Outdated Show resolved Hide resolved
include/deal.II/matrix_free/matrix_free.h Outdated Show resolved Hide resolved
@peterrum
Copy link
Member Author

peterrum commented Oct 3, 2020

@kronbichler I have addressed many of your comments. There are some where I am not sure how to proceed.

Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward. Of course, we still need to fill interfaces for the vector before we can use it more generally, but I would be fine to leave that to another PR (if we manage to finalize things during the next month). It would be great if someone else can give it a look as well.

@kronbichler
Copy link
Member

/rebuild

@tjhei
Copy link
Member

tjhei commented Oct 7, 2020

I have some fundamental questions (without having looked at the code very closely):

  1. To me this seems like an optimization of the existing parallel vector (to grant "local" access to entries on the same node). Is there anything else that is new/different?
  2. Why would one not want to use this over the existing parallel vector?
  3. If there is no good reason to do so in 2), why is this not an extension to the existing vector class? This is not particularly user-friendly to introduce another new vector type. Maybe this can be a setting/policy/compile time setting?
  4. Does this mean that we need a lot of new template instantiations for the new vector type? That would worry me.
  5. I am not sure I like the name "SharedMPI" (not that I have a better idea, but it reminds me of shared::Tria).

@peterrum
Copy link
Member Author

peterrum commented Oct 8, 2020

To me this seems like an optimization of the existing parallel vector (to grant "local" access to entries on the same node). Is there anything else that is new/different?

I had the option to introduce 1) a new MemorySpace-type or 2) a new modus/setting of L:d:V or 3) a new vector.

The first two options would lead to the fact that L:d:V would have per default information regarding shared memory: 2 communicator, functions which would only work under certain circumstances...

In my opinion, the cleanest approach is to introduce a vector, since it is different in its core. For instance, L:d:V is create with IndexSets (which is transformed into a Partitioner) and such the vector has a global view on the data. However, the new vector is much simpler: it is only a container of data of a specific size (and such data can be only accessed with local indices). The interpretation of the data is completely handled by a new set of (potentially user-provided) partitioners. For instance, this PR introduces a partitioner which is built around IndexSets. But in hyper.deal we have a tailored partitioner for DG (once this PR is accepted and merged, we will create a follow-up PR to move it here).

Why would one not want to use this over the existing parallel vector?

One would need to work with a second communicator and potentially deal with race conditionally in user code.

If there is no good reason to do so in 2), why is this not an extension to the existing vector class? This is not particularly user-friendly to introduce another new vector type. Maybe this can be a setting/policy/compile time setting?

  • setting: this is more than a setting: the user needs to provide a second communicator and a partitioner (depending on the setting either one of two partitioners (old/new) and/or different methods are used)
  • the policy is the partitioner, but, I don't think it is a good idea to give the old Utilities::MPI::Partitioner the same interface as the new one (which is specialized for the case that you have pointers to the data of other processes).
  • compile-time setting has the same problems as the first point

Does this mean that we need a lot of new template instantiations for the new vector type? That would worry me.

The major benefit of this class will be in performance-critical code paths, i.e., MatrixFree: we are working on a native support for this vector class also for DG in MatrixFree, where we expect the largest benefit. So one could use MatrixFree to setup the vectors or create a L:d:V and/or copy over the content once from a L:d:V, without the need to instantiate all the functions. The first approach works very well (in hyper.deal) and we should probably create a set of new function in MatrixFreeTools (since probably all users of MatrixFree have implemented their own routines for computing norms, intepolating/projection some functions onto a vector, computing the right-hand side vector with MatrixFree).

I am not sure I like the name "SharedMPI" (not that I have a better idea, but it reminds me of shared::Tria).

That p:s:T has a misleading and unfortunate name should not be a reason to dismiss the name SharedMPI: p:s:T is not shared but simply replicated on all processes, which lead to some confusion until I had a look at the implementation. The new vector is actually shared among processes via shared memory (provided by MPI).

Copy link
Member

@tjhei tjhei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could write documentation for the new vector type. That would have helped with my misunderstanding what this class is trying to do.

Also, we only require MPI 2 so some of this needs to be guarded by ifdefs, right?


namespace LinearAlgebra
{
namespace SharedMPI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace documentation?



template <typename Number, typename MemorySpace = MemorySpace::Host>
class Vector : public ::dealii::LinearAlgebra::VectorSpaceVector<Number>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, no documentation for the new class? Can you fix that please?

* Get pointers to the beginning of the values of the other
* processes of the same shared-memory domain.
*
* TODO: name of the function?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@tjhei
Copy link
Member

tjhei commented Oct 8, 2020

One would need to work with a second communicator and potentially deal with race conditionally in user code.

So maybe I misunderstood what this vector class can do. Are you saying that this is to be used only within a node (using a separate communicator) and there is no MPI/communication capabilities? An overview about what this class does would be great to have as class documentation.

@peterrum peterrum changed the title Add LinearAlgebra::SharedMPI::Vector [WIP] Add LinearAlgebra::SharedMPI::Vector Oct 16, 2020
@peterrum peterrum added this to In progress in MPI-3 shared memory Oct 23, 2020
@peterrum
Copy link
Member Author

peterrum commented Nov 5, 2020

I am closing this PR since SM feature has been introduced into L:p:V in a sequence of PRs beginning from PR #11074.

@peterrum peterrum closed this Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants