Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when building Trilinos matrices with more than 2B entries #13428

Closed
bangerth opened this issue Feb 21, 2022 · 12 comments · Fixed by #13445 or #13452
Closed

Failure when building Trilinos matrices with more than 2B entries #13428

bangerth opened this issue Feb 21, 2022 · 12 comments · Fixed by #13445 or #13452
Milestone

Comments

@bangerth
Copy link
Member

bangerth commented Feb 21, 2022

@marcfehling ran into a problem where he has a matrix with 15,110,817 rows and columns, and 2,280,732,000 entries (just barely more than the 2,147,483,647 that form the upper end of what a signed int can represent). This then leads to a segfault as follows:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1554af260000)
 4  /expanse/lustre/projects/ucd150/mfehling/bin/aocc/trilinos-13.0.1/lib/libepetra.so.13(Epetra_CrsGraph::OptimizeStorage()+0x4e7) [0x155543905307]
 5  /home/mfehling/project/bin/aocc/dealii-10.0.0-pre/lib/libdeal_II.so.10.0.0-pre(dealii::TrilinosWrappers::SparsityPattern::compress()+0x110) [0x155550e8ec30]
 6  /home/mfehling/project/bin/aocc/dealii-10.0.0-pre/lib/libdeal_II.so.10.0.0-pre(dealii::BlockSparsityPatternBase<dealii::TrilinosWrappers::SparsityPattern>::compress()+0x6d) [0x155550cfb05d]
 7  /expanse/lustre/projects/ucd150/mfehling/build-parallel-matrixfree-hp/stokes(Problem::Stokes<3, Trilinos, 3>::setup_system()+0xe98) [0x43a9f8]
 8  /expanse/lustre/projects/ucd150/mfehling/build-parallel-matrixfree-hp/stokes(Problem::Stokes<3, Trilinos, 3>::run()+0x45f) [0x4390cf]
 9  /expanse/lustre/projects/ucd150/mfehling/build-parallel-matrixfree-hp/stokes(main+0x2a1) [0x33cbd1]
10  /lib64/libc.so.6(__libc_start_main+0xf3) [0x15553f7f5873]
11  /expanse/lustre/projects/ucd150/mfehling/build-parallel-matrixfree-hp/stokes(_start+0x2e) [0x33c86e]

He reports that this happens both with compiling in 32- and 64-bit mode.

We should take a look at whether we call one of Trilinos's 32-bit functions by accident where there is a 64-bit function we should be calling instead.

@marcfehling
Copy link
Member

I triggered the segfault with a single MPI process.

The program gets past the setup_system part with more than four processes. I get an unknown exception when running on two processes, which may or may not be related to this segfault. I encountered this issue when generating strong scaling results.

@bangerth
Copy link
Member Author

bangerth commented Feb 21, 2022 via email

@bangerth
Copy link
Member Author

The following test fails after we find that Trilinos throws an integer error with code -99:

void
test()
{
  TrilinosWrappers::SparsityPattern sp;
  const unsigned int                N = 10000000; // ten million                                                                                                                                                                                                               
  const unsigned int                entries_per_row =
    220; // so that we just get over the 2.1B limit for MAX_INT                                                                                                                                                                                                                

  sp.reinit(N, N, entries_per_row);
  deallog << "SP::is_compressed(): " << sp.is_compressed() << std::endl;

  for (unsigned int i = 0; i < N; ++i)
    for (unsigned int j = 0; j < entries_per_row; ++j)
      sp.add(i, j);

  deallog << "SP::is_compressed(): " << sp.is_compressed() << std::endl;

  try
    {
      sp.compress();
    }

  catch (const int i)
    {
      deallog << "sp.compress() ended with an integer exception with error code "
              << i
              << std::endl;
      return;
    }

  deallog << "SP::is_compressed(): " << sp.is_compressed() << std::endl;
  deallog << "Number of entries: " << sp.n_nonzero_elements() << std::endl;
  deallog << "Number of rows: " << sp.n_rows() << std::endl;
  deallog << "Number of columns: " << sp.n_cols() << std::endl;
  deallog << "Local size: " << sp.local_size() << std::endl;
  deallog << "Max row length: " << sp.max_entries_per_row() << std::endl;
  deallog << "SP::row_length(0): " << sp.row_length(0) << std::endl;
  deallog << "Bandwidth: " << sp.bandwidth() << std::endl;
  deallog << "SP::empty(): " << sp.empty() << std::endl;

  deallog << "OK" << std::endl;
}



int
main(int argc, char **argv)
{
  initlog();

  Utilities::MPI::MPI_InitFinalize mpi_initialization(
    argc, argv, testing_max_num_threads());

  try
    {
      test();
    }
  catch (std::exception &exc)
    {
      deallog << std::endl
              << std::endl
              << "----------------------------------------------------"
              << std::endl;
      deallog << "Exception on processing: " << std::endl
              << exc.what() << std::endl
              << "Aborting!" << std::endl
              << "----------------------------------------------------"
              << std::endl;

      return 1;
    }
  catch (...)
    {
      deallog << std::endl
              << std::endl
              << "----------------------------------------------------"
              << std::endl;
      deallog << "Unknown exception!" << std::endl
              << "Aborting!" << std::endl
              << "----------------------------------------------------"
              << std::endl;
      return 1;
    };
}

The test requires about 9GB of memory, and runs for around 4 minutes on my main machine.

@jpthiele
Copy link
Contributor

I might be completely mistaken, but isn't the local ordinal type of Epetra hard coded to int?
That would explain why it's fine on 4 processes, as all local indices fit into int with the maximum somewhere around 0.52B
instead of the 2.1B local entries on a single process.

@Rombur
Copy link
Member

Rombur commented Feb 23, 2022

I might be completely mistaken, but isn't the local ordinal type of Epetra hard coded to int?

Yes, that's correct. You cannot have more than 2B entries per MPI ranks

@bangerth
Copy link
Member Author

Then that's something we should check on our side :-)

@jpthiele
Copy link
Contributor

There is already a ToDo in the corresponding case that might be related.

@jpthiele
Copy link
Contributor

I'm not sure which point would be the best to add the check/assert, but apart from that it might be a quick addition.
My idea for an AssertThrow ExcMessage would be something like
"The TrilinosWrappers use Epetra internally which uses signed int as local indices.
Therefore, an execution with around 2.1B nonzero matrix entries per core is not possible.
If possible, use more MPI processes."

@bangerth
Copy link
Member Author

As pointed out in #13445, the real application doesn't actually go through the paths checked there. Rather, when we call make_sparsity_pattern() on Trilinos sparsity patterns, we treat the sparsity pattern as "dynamic", and add entries via this function:

  template <typename ForwardIterator>
  inline void
  SparsityPattern::add_entries(const size_type row,
                               ForwardIterator begin,
                               ForwardIterator end,
                               const bool /*indices_are_sorted*/)
  {
    if (begin == end)
      return;

    // verify that the size of the data type Trilinos expects matches that the
    // iterator points to. we allow for some slippage between signed and
    // unsigned and only compare that they are both either 32 or 64 bit. to
    // write this test properly, not that we cannot compare the size of
    // '*begin' because 'begin' may be an iterator and '*begin' may be an
    // accessor class. consequently, we need to somehow get an actual value
    // from it which we can by evaluating an expression such as when
    // multiplying the value produced by 2
    Assert(sizeof(TrilinosWrappers::types::int_type) == sizeof((*begin) * 2),
           ExcNotImplemented());

    TrilinosWrappers::types::int_type *col_index_ptr =
      reinterpret_cast<TrilinosWrappers::types::int_type *>(
        const_cast<typename std::decay<decltype(*begin)>::type *>(&*begin));
    // Check at least for the first index that the conversion actually works
    AssertDimension(*col_index_ptr, *begin);
    TrilinosWrappers::types::int_type trilinos_row_index = row;
    const int                         n_cols = static_cast<int>(end - begin);

    int ierr;
    if (row_is_stored_locally(row))
      ierr =
        graph->InsertGlobalIndices(trilinos_row_index, n_cols, col_index_ptr);
    else if (nonlocal_graph.get() != nullptr)
      {
        // this is the case when we have explicitly set the off-processor rows
        // and want to create a separate matrix object for them (to retain
        // thread-safety)
        Assert(nonlocal_graph->RowMap().LID(
                 static_cast<TrilinosWrappers::types::int_type>(row)) != -1,
               ExcMessage("Attempted to write into off-processor matrix row "
                          "that has not be specified as being writable upon "
                          "initialization"));
        ierr = nonlocal_graph->InsertGlobalIndices(trilinos_row_index,
                                                   n_cols,
                                                   col_index_ptr);
      }
    else
      ierr = graph->InsertGlobalIndices(1,
                                        &trilinos_row_index,
                                        n_cols,
                                        col_index_ptr);

    AssertThrow(ierr >= 0, ExcTrilinosError(ierr));
  }

The problem is that I can't easily check here whether we overflow the allowed number of entries in a sparsity pattern because we may be adding entries that are already in the sparsity pattern -- not every entry so added is a new one. What's worse, the error actually only happens at a later time, namely the call to Epetra_CrsGraph::OptimizeStorage(), which I take means that the calls above copy new entries into some sort of cache that is only compressed when calling Epetra_CrsGraph::OptimizeStorage(). So I don't see a good place to check whether we're overflowing int.

What I should probably do is wrap the call of Epetra_CrsGraph::OptimizeStorage() into a try-catch block and ensure that we get a reasonable error message if there is an exception. Does that sound reasonable?

@jpthiele
Copy link
Contributor

jpthiele commented Feb 25, 2022

From what I read in the Epetra_CrsGraph docs and code your assumption about the buffer is correct.
I think there is an error or a relatively easy fix in FillComplete()
When calculating the local index using int LID = Epetra_BlockMap::LID(gid) it checks for an error code -1
Instead it could check (LID < 0) for about the same price and return an error at that point.
Edit: I opened a trilinos issue to ask if there would be any reason to not check this instead

@bangerth
Copy link
Member Author

I've got some more explanations in #13452. Fundamentally, there are two Trilinos issues:

  • The addition of nonzero entries inside Epetra_CrsGraph overflows CrsGraphData_->NumMyNonzeros_, at which point it becomes a negative number.
  • The Epetra_CrsGraph::OptimizeStorage() function should return error codes as regular return values, but decides to instead use an exception. That's just poor design.

@marcfehling
Copy link
Member

I agree with @bangerth to merge both patches #13445 and #13452.

Are there other opinions on the topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants