New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure when building Trilinos matrices with more than 2B entries #13428
Comments
I triggered the segfault with a single MPI process. The program gets past the |
On 2/20/22 21:06, Marc Fehling wrote:
The program gets past the |setup_system| part with more than four
processes. I get an |unknown exception| when running on two processes,
which may or may not be related to this segfault. I encountered this issue
when generating strong scaling results.
Trilinos occasionally indicates errors by just throwing bare |int| variables.
We tend to not be very graceful with those and report them as |unknown exception|.
|
The following test fails after we find that Trilinos throws an integer error with code
The test requires about 9GB of memory, and runs for around 4 minutes on my main machine. |
I might be completely mistaken, but isn't the local ordinal type of Epetra hard coded to int? |
Yes, that's correct. You cannot have more than 2B entries per MPI ranks |
Then that's something we should check on our side :-) |
There is already a ToDo in the corresponding case that might be related. |
I'm not sure which point would be the best to add the check/assert, but apart from that it might be a quick addition. |
As pointed out in #13445, the real application doesn't actually go through the paths checked there. Rather, when we call
The problem is that I can't easily check here whether we overflow the allowed number of entries in a sparsity pattern because we may be adding entries that are already in the sparsity pattern -- not every entry so added is a new one. What's worse, the error actually only happens at a later time, namely the call to What I should probably do is wrap the call of |
From what I read in the Epetra_CrsGraph docs and code your assumption about the buffer is correct. |
I've got some more explanations in #13452. Fundamentally, there are two Trilinos issues:
|
@marcfehling ran into a problem where he has a matrix with 15,110,817 rows and columns, and 2,280,732,000 entries (just barely more than the 2,147,483,647 that form the upper end of what a signed int can represent). This then leads to a segfault as follows:
He reports that this happens both with compiling in 32- and 64-bit mode.
We should take a look at whether we call one of Trilinos's 32-bit functions by accident where there is a 64-bit function we should be calling instead.
The text was updated successfully, but these errors were encountered: