Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove parallel::TriangulationBase::compute_n_locally_owned_active_cellls_per_processor #9945

Conversation

peterrum
Copy link
Member

In 5eb3561, we have replaced the method n_locally_owned_active_cells_per_processor() by compute_n_locally_owned_active_cells_per_processor() since we did not want to store the information in the NumberCache anymore (since it is too expensive for large simulations). I think it would be an option to remove the new function since the same effect can be reached in the user code by the following one liner:

Utilities::MPI::all_gather(tr.get_communicator(), tr.n_locally_owned_active_cells());

I am not sure related backwards compatibility. But, I guess the change in 5eb3561 was not backwards compatible either.

@kronbichler
Copy link
Member

This touches the bigger question of the same function in the DoFHandler:

std::vector<types::global_dof_index>
compute_n_locally_owned_dofs_per_processor() const;

This was introduced before in #8298, so we have not released that part either and we could replace it as well. The only nuisance is the other function
std::vector<IndexSet>
compute_locally_owned_dofs_per_processor() const;

where we get an IndexSet and I am not sure whether all_gather works. In case it does, we should remove this lengthy implementation
std::vector<IndexSet>
NumberCache::get_locally_owned_dofs_per_processor(
const MPI_Comm mpi_communicator) const
{
AssertDimension(locally_owned_dofs.size(), n_global_dofs);
const unsigned int n_procs =
Utilities::MPI::job_supports_mpi() ?
Utilities::MPI::n_mpi_processes(mpi_communicator) :
1;
if (n_global_dofs == 0)
return std::vector<IndexSet>();
else if (locally_owned_dofs_per_processor.empty() == false)
{
AssertDimension(locally_owned_dofs_per_processor.size(), n_procs);
return locally_owned_dofs_per_processor;
}
else
{
std::vector<IndexSet> locally_owned_dofs_per_processor(
n_procs, locally_owned_dofs);
#ifdef DEAL_II_WITH_MPI
if (n_procs > 1)
{
// this step is substantially more complicated because indices
// might be distributed arbitrarily among the processors. Here we
// have to serialize the IndexSet objects and shop them across the
// network.
std::vector<char> my_data;
{
# ifdef DEAL_II_WITH_ZLIB
boost::iostreams::filtering_ostream out;
out.push(boost::iostreams::gzip_compressor(
boost::iostreams::gzip_params(
boost::iostreams::gzip::best_speed)));
out.push(boost::iostreams::back_inserter(my_data));
boost::archive::binary_oarchive archive(out);
archive << locally_owned_dofs;
out.flush();
# else
std::ostringstream out;
boost::archive::binary_oarchive archive(out);
archive << locally_owned_dofs;
const std::string &s = out.str();
my_data.reserve(s.size());
my_data.assign(s.begin(), s.end());
# endif
}
// determine maximum size of IndexSet
const unsigned int max_size =
Utilities::MPI::max(my_data.size(), mpi_communicator);
// as the MPI_Allgather call will be reading max_size elements,
// and as this may be past the end of my_data, we need to increase
// the size of the local buffer. This is filled with zeros.
my_data.resize(max_size);
std::vector<char> buffer(max_size * n_procs);
const int ierr = MPI_Allgather(my_data.data(),
max_size,
MPI_BYTE,
buffer.data(),
max_size,
MPI_BYTE,
mpi_communicator);
AssertThrowMPI(ierr);
for (unsigned int i = 0; i < n_procs; ++i)
if (i == Utilities::MPI::this_mpi_process(mpi_communicator))
locally_owned_dofs_per_processor[i] = locally_owned_dofs;
else
{
// copy the data previously received into a stringstream
// object and then read the IndexSet from it
std::string decompressed_buffer;
// first decompress the buffer
{
# ifdef DEAL_II_WITH_ZLIB
boost::iostreams::filtering_ostream decompressing_stream;
decompressing_stream.push(
boost::iostreams::gzip_decompressor());
decompressing_stream.push(
boost::iostreams::back_inserter(decompressed_buffer));
decompressing_stream.write(&buffer[i * max_size],
max_size);
# else
decompressed_buffer.assign(&buffer[i * max_size],
max_size);
# endif
}
// then restore the object from the buffer
std::istringstream in(decompressed_buffer);
boost::archive::binary_iarchive archive(in);
archive >> locally_owned_dofs_per_processor[i];
}
}
#endif
return locally_owned_dofs_per_processor;
}
}

and just kick out the functions altogether.

@peterrum
Copy link
Member Author

@kronbichler After a quick look at the implementation of Utilities::MPI::all_gather(), I am quite sure it should work also for IndexSet: it is working via Boost's serialization and can handle variable size quantities.

See:

template <typename T>
std::vector<T>
all_gather(const MPI_Comm &comm, const T &object)
{
# ifndef DEAL_II_WITH_MPI
(void)comm;
std::vector<T> v(1, object);
return v;
# else
const auto n_procs = dealii::Utilities::MPI::n_mpi_processes(comm);
std::vector<char> buffer = Utilities::pack(object);
int n_local_data = buffer.size();
// Vector to store the size of loc_data_array for every process
std::vector<int> size_all_data(n_procs, 0);
// Exchanging the size of each buffer
MPI_Allgather(
&n_local_data, 1, MPI_INT, size_all_data.data(), 1, MPI_INT, comm);
// Now computing the displacement, relative to recvbuf,
// at which to store the incoming buffer
std::vector<int> rdispls(n_procs);
rdispls[0] = 0;
for (unsigned int i = 1; i < n_procs; ++i)
rdispls[i] = rdispls[i - 1] + size_all_data[i - 1];
// Step 3: exchange the buffer:
std::vector<char> received_unrolled_buffer(rdispls.back() +
size_all_data.back());
MPI_Allgatherv(buffer.data(),
n_local_data,
MPI_CHAR,
received_unrolled_buffer.data(),
size_all_data.data(),
rdispls.data(),
MPI_CHAR,
comm);
std::vector<T> received_objects(n_procs);
for (unsigned int i = 0; i < n_procs; ++i)
{
std::vector<char> local_buffer(received_unrolled_buffer.begin() +
rdispls[i],
received_unrolled_buffer.begin() +
rdispls[i] + size_all_data[i]);
received_objects[i] = Utilities::unpack<T>(local_buffer);
}
return received_objects;
# endif
}

@peterrum
Copy link
Member Author

peterrum commented Apr 25, 2020

Should I include compute_n_locally_owned_dofs_per_processor() and compute_locally_owned_dofs_per_processor() in this PR as well?

@kronbichler
Copy link
Member

kronbichler commented Apr 25, 2020

They are all of the same kind, so I would say we either remove all or we remove none. I would vote fore removing them all.

@peterrum
Copy link
Member Author

@kronbichler Do you think like this c41b635? Currently 57 tests fail (these use once of the compute_-functions). Before I tackle these and modify the hp::DoFHandler in the same way, I would like to know if this is what you had in mind?

Copy link
Member

@kronbichler kronbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost exactly like what I had in mind, except that I would purge the functionality from NumberCache and simply let the DoFHandler compute that info? The function is deprecated and I think it would help later clean-up if there is no other class/struct involved any more.

Comment on lines 1447 to 1467
if (number_cache.n_locally_owned_dofs_per_processor.empty() &&
number_cache.n_global_dofs > 0)
{
MPI_Comm comm;

const parallel::TriangulationBase<dim, spacedim> *tr =
(dynamic_cast<const parallel::TriangulationBase<dim, spacedim> *>(
&this->get_triangulation()));
if (tr != nullptr)
comm = tr->get_communicator();
else
comm = MPI_COMM_SELF;

const_cast<dealii::internal::DoFHandlerImplementation::NumberCache &>(
number_cache)
.n_locally_owned_dofs_per_processor =
compute_n_locally_owned_dofs_per_processor();
number_cache.get_n_locally_owned_dofs_per_processor(comm);
}
return number_cache.n_locally_owned_dofs_per_processor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this implementation, can't we bypass this code and simply return Utilities::MPI::all_gather(...)?

Comment on lines +459 to +460
Utilities::MPI::all_gather(mpi_communicator,
dof_handler.locally_owned_dofs()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this patch, but I believe we should replace this argument by simply dof_handler.locally_owned_dofs() - I must have forgotten those in #9710.

const_cast<dealii::internal::DoFHandlerImplementation::NumberCache &>(
number_cache)
.locally_owned_dofs_per_processor =
compute_locally_owned_dofs_per_processor();
number_cache.get_locally_owned_dofs_per_processor(comm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

const_cast<dealii::internal::DoFHandlerImplementation::NumberCache &>(
mg_number_cache[level])
.locally_owned_dofs_per_processor =
compute_locally_owned_mg_dofs_per_processor(level);
mg_number_cache[level].get_locally_owned_dofs_per_processor(comm);
}
return mg_number_cache[level].locally_owned_dofs_per_processor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

#endif
return locally_owned_dofs_per_processor;
return Utilities::MPI::all_gather(mpi_communicator,
locally_owned_dofs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you use this function, but I think we can remove the function altogether - this class is mostly internal anyway, I think.

@peterrum
Copy link
Member Author

@kronbichler Thanks for the fast feedback.

For this implementation, can't we bypass this code and simply return Utilities::MPI::all_gather(...)?

Remember back why this functions caused us headaches half a year ago: reason 1) memory consumption scales with the number processes and reason 2) the function does not return a vector but a reference to a vector. Latter was the reason why we fill the vector only when needed.

I would remove the vector once the deprecated functions are deleted. I.e in two weeks?

@kronbichler
Copy link
Member

Sure, the vector can only be removed once we have eliminated the deprecated function. But the number cache can be cleaned up already in terms of the unnecessary function - but I agree on a second sight that it is probably no big difference since the vector is in NumberCache.

@peterrum peterrum added this to the Release 9.2 milestone Apr 27, 2020
@masterleinad
Copy link
Member

/home/runner/work/dealii/dealii/source/dofs/number_cache.cc:90:26: error: unused variable ‘n_procs’ [-Werror=unused-variable]
       const unsigned int n_procs =
                          ^~~~~~~

@peterrum peterrum force-pushed the compute_n_locally_owned_active_cells_per_processor_remove branch from c41b635 to 14c3053 Compare April 27, 2020 22:28
@peterrum peterrum force-pushed the compute_n_locally_owned_active_cells_per_processor_remove branch from 14c3053 to 78bdbb1 Compare April 28, 2020 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants