Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recoverable SNES and TS #15276

Merged
merged 1 commit into from Jun 15, 2023

Conversation

stefanozampini
Copy link
Contributor

depends on #15271 #15269

@stefanozampini
Copy link
Contributor Author

@bangerth This should also go the release.

@bangerth bangerth added this to the Release 9.5 milestone May 27, 2023
Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like too complicated a scheme, because you have to first call call_and_possibly_capture_exception() and then re-interpret what that function did. What if you added an extra argument to that function, say const std::function<bool()> &deal_with_recoverable_error, that is called in the catch (const RecoverableUserCallbackError &exc) case. If that function returns true, then the exception is simply absorbed and we return -1; if the function returns false, then we save the exception and return -1. This way, for each callback where there is a specific way to signal to PETSc that an error is recoverable, we can do that in the function object and return true; for each place where we can't, we do nothing but return false.

include/deal.II/lac/petsc_snes.templates.h Outdated Show resolved Hide resolved
include/deal.II/lac/petsc_snes.templates.h Outdated Show resolved Hide resolved
tests/petsc/petsc_snes_05.cc Show resolved Hide resolved
@bangerth
Copy link
Member

bangerth commented Jun 4, 2023

Would you mind rebasing this on current master to get rid of the already-merged commits?

@stefanozampini
Copy link
Contributor Author

Would you mind rebasing this on current master to get rid of the already-merged commits?

done

Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch has three parts, two of which are fine and one which I disagree with. Can you submit them separately?

  • The part where you change PetscFunctionReturn(err) to PetscFunctionReturn(err ? PETSC_ERR_LIB : 0);. This part is fine, and I'd like to get this committed via a separate PR.
  • The addition of the compress() calls to tests. This is fine as well, same here: Separate PR.
  • The part I disagree with is the error handling. I'll post a separate proposal in a separate comment in a second.

@bangerth
Copy link
Member

bangerth commented Jun 8, 2023

About the error handling: The way you deal with this is that if you get a recoverable user callback error, you deal with this in the call_and_possibly_capture_snes_exception() function by returning -1 instead of -2:

     catch (const RecoverableUserCallbackError &)
        {
          // ok, ignore, but reset the pointer
          eptr = nullptr;
          return -1;
      }
      catch (...)
        {
          eptr = std::current_exception();
          return -1;
          return -2;
        }

The problem is that you have to then deal with this at all calling locations, but you really only do it for the residual function. In all of the other places, right now you just eat the exception, pass the error on to PETSc, which then probably doesn't know what to do, errors out, but you have eaten the exception so you cannot re-throw it -- though it would be useful to know from a user perspective that the failure was due to a recoverable error that wasn't recovered from.

I think a better system would go as follows:

  • Change the call_and_possibly_capture_snes_exception() function to the following:
    /**
     * A function that calls the function object given by its first argument
     * with the set of arguments following at the end. If the call returns
     * regularly, the current function returns zero to indicate success. If
     * the call fails with an exception, then the current function returns with
     * an error code of -1. In that case, the exception thrown by `f` is
     * captured and `eptr` is set to the exception. In case of success,
     * `eptr` is set to `nullptr`.
     *
     * If the user callback fails with a recoverable exception, then (i) if
     * `recoverable_action` is set, execute it, eat the exception, and return
     * zero; (ii) if `recoverable_action` is an empty function object, store the
     * exception and return -1.
     */
    template <typename F, typename... Args>
    int
    call_and_possibly_capture_snes_exception(const F &           f,
                                             std::exception_ptr &eptr,
                                             const std::function<void()> &recoverable_action,
                                             Args &&...args)
    {
      // See whether there is already something in the exception pointer
      // variable. There is no reason why this should be so, and
      // we should probably bail out:
      AssertThrow(eptr == nullptr, ExcInternalError());

      // Call the function and if that succeeds, return zero:
      try
        {
          f(std::forward<Args>(args)...);
          eptr = nullptr;
          return 0;
        }
      // In case of a recoverable exception, do as suggested above:
      catch (const RecoverableUserCallbackError &)
        {
          if (recoverable_action)
          {
             recoverable_action();
             eptr = nullptr;
             return 0;
          }
          else
          {
            eptr = std::current_exception();
            return -1;
          }
        }
      // In case of an exception, capture the exception and
      // return -1:
      catch (...)
        {
          eptr = std::current_exception();
          return -1;
        }
    }
  • For callbacks for which recovery is not an option, you would then simply call call_and_possibly_capture_snes_exception() with {} for the new argument.
  • In places where recovery is an option, you would call call_and_possibly_capture_snes_exception with something like [f]() { *f = std::numeric_limits<real_type>::quiet_NaN(); } for the new argument.

My preference, though, would be to call the function Jed Brown suggested instead of explicitly setting vector elements to NaNs.

@stefanozampini
Copy link
Contributor Author

@bangerth Thanks for the suggestion about the recoverable error callback. I have implemented it and extended recovery capabilities to PETScWrappers::TimeStepper too. Please review again, this should now satisfy your requirements

Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this. I think it's quite close.

My only question is whether we should expose these recoverable_action_... callbacks to users. I don't think you use them in any of the tests (but may have missed it?) but from a software design perspective, this configuration point seems duplicative: These callbacks get called whenever a previous callback threw an exception. Whatever a user wants to do in the recoverable_action_* function is something they could have already done at the point where they raised the exception to begin with.

So my preference would be to just get rid of these function objects and only use the internal ones you already define.

@stefanozampini
Copy link
Contributor Author

Yes, I like this. I think it's quite close.

My only question is whether we should expose these recoverable_action_... callbacks to users. I don't think you use them in any of the tests (but may have missed it?) but from a software design perspective, this configuration point seems duplicative: These callbacks get called whenever a previous callback threw an exception. Whatever a user wants to do in the recoverable_action_* function is something they could have already done at the point where they raised the exception to begin with.

So my preference would be to just get rid of these function objects and only use the internal ones you already define.

Done, with proper handling of unrecoverable errors too. This is ready for final review

@stefanozampini
Copy link
Contributor Author

For example, now if you raise an unrecoverable exception, you also get a nice stack trace in PETSc

$ ./petsc_ts_05.debug 
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Error in external library
[0]PETSC ERROR: Failure in ts_ijacobian from dealii::PETScWrappers::TimeStepper
...
[0]PETSC ERROR: #1 implicit_jacobian() at /home/szampini/Devel/dealii/include/deal.II/lac/petsc_ts.templates.h:489
[0]PETSC ERROR: #2 TSComputeIJacobian() at /home/szampini/Devel/petsc/src/ts/interface/ts.c:920
[0]PETSC ERROR: #3 SNESTSFormJacobian_BDF() at /home/szampini/Devel/petsc/src/ts/impls/bdf/bdf.c:375
[0]PETSC ERROR: #4 SNESTSFormJacobian() at /home/szampini/Devel/petsc/src/ts/interface/ts.c:4262
[0]PETSC ERROR: #5 SNESComputeJacobian() at /home/szampini/Devel/petsc/src/snes/interface/snes.c:2755
[0]PETSC ERROR: #6 SNESSolve_NEWTONLS() at /home/szampini/Devel/petsc/src/snes/impls/ls/ls.c:216
[0]PETSC ERROR: #7 SNESSolve() at /home/szampini/Devel/petsc/src/snes/interface/snes.c:4631
[0]PETSC ERROR: #8 TSBDF_SNESSolve() at /home/szampini/Devel/petsc/src/ts/impls/bdf/bdf.c:209
[0]PETSC ERROR: #9 TSStep_BDF() at /home/szampini/Devel/petsc/src/ts/impls/bdf/bdf.c:269
[0]PETSC ERROR: #10 TSStep() at /home/szampini/Devel/petsc/src/ts/interface/ts.c:3449
[0]PETSC ERROR: #11 TSSolve() at /home/szampini/Devel/petsc/src/ts/interface/ts.c:3842

Copy link
Member

@bangerth bangerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments and one question.

include/deal.II/lac/petsc_ts.h Outdated Show resolved Hide resolved
include/deal.II/lac/petsc_ts.h Outdated Show resolved Hide resolved
Comment on lines +1148 to +1155
return PetscError(
PetscObjectComm((PetscObject)ppc),
lineno + 3,
"vmult",
__FILE__,
PETSC_ERR_LIB,
PETSC_ERROR_INITIAL,
"Failure in pcapply from dealii::PETScWrappers::NonlinearSolver");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that by default, PetscError simply returns the error code (here PETSC_ERR_LIB) and does nothing else? But that one can configure things to print errors on the console with all the information you are giving as arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? PetscError is the entry point for error handling in PETSc, hidden in the macro calls SETERRQ, CHKERRQ (in the past) and PetscCall(...) in the most recent versions. It can start a cascade of errors (when called with PETSC_ERROR_INITIAL, used in the SETERRQ macro) or continue going up the hierarchy of calls (with PETSC_ERROR_REPEAT in the PetscCall and CHKERRQ ). Inside PetscError a few things are done, including the preparation of the line for the traceback and the invocation of the error handler, which by default is PetscTraceBackErrorHandler, which prints the stack trace above. The error handler is customizable by the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was mainly what happens by default when you hit PetscError. Does it automatically print a stacktrace, or does it only do that if you configure PETSc to do so (or pass an appropriate command line flag)?

I'm asking because throwing exceptions (recoverable or not) should be a totally legitimate thing to do. The "error" is simply if you don't deal with them somewhere before you are back in main(), and at that time the stacktrace should be printed. If you throw and catch an exception, no error should be printed because no error has happened: In the catch statement, you are dealing with the error, after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default error handling is the one that prints the traceback above. We don't get the what() the message when the error is raised by the solve_with_jacobian callback because I'm eating the excpetion in PetscPreconditioner. Before over-engineering the code, we should probably decide what to do in terms of unifying the usage of AssertPETSc (i.e. throw exception, i.e. when calling PETSc in deal.II world), PetscCall (i.e. calling PETSc in PETSc callbacks like snes_function in NonlinearSolver) and, possibly, have a single and unified call_and_possibly_recover that will be shared by all the PETSc code in deal.II

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bangerth On second thought (and by looking at #15349 and #15360) I think I should handle the exception here to have a more uniform error report from the nonlinear solvers. I will push one last change before merging this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late -- the PR is already merged :-)

@stefanozampini
Copy link
Contributor Author

@bangerth Ok, I think this is ready now

@bangerth
Copy link
Member

/rebuild

@stefanozampini stefanozampini mentioned this pull request Jun 15, 2023
@luca-heltai luca-heltai merged commit e1f4330 into dealii:master Jun 15, 2023
14 checks passed
// Failed reason is not reset uniformly within the
// interface code of PCSetUp in PETSc.
// We handle it here.
PetscCall(PCSetFailedReason(ppc, PC_NOERROR));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks our regression testers, see https://cdash.dealii.org/viewBuildError.php?buildid=1876. Looks like this function was only added in PETSc 3.14, see petsc/petsc@1b2b984 but we claim to support 3.7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to know this before merging to main.... I'll fix it ASAP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanozampini stefanozampini deleted the stefanozampini/recoverable branch July 18, 2023 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants