Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step-77: compatibility issues and no convergence #12594

Closed
lpsaavedra opened this issue Jul 23, 2021 · 18 comments · Fixed by #13900
Closed

Step-77: compatibility issues and no convergence #12594

lpsaavedra opened this issue Jul 23, 2021 · 18 comments · Fixed by #13900
Labels
Milestone

Comments

@lpsaavedra
Copy link
Contributor

Step-77 solves a non-linear equation using the Kinsol solver of the Sundials library. deal.ii must be compiled with Sundials on. The tutorial does not compile with the Sundials 3.1.0 distributed with candi, getting the following error in debug mode:

--------------------------------------------------------
An error occurred in line <427> of file </home/blaisb/work/dealii/dealii/source/sundials/kinsol.cc> in function
    unsigned int dealii::SUNDIALS::KINSOL<VectorType>::solve(VectorType&) [with VectorType = dealii::Vector<double>]
The violated condition was: 
    solve_jacobian_system
Additional information: 
    Please provide an implementation for the function
    "solve_jacobian_system"

Stacktrace:
-----------
#0  /home/blaisb/work/dealii/inst/lib/libdeal_II.g.so.10.0.0-pre: dealii::SUNDIALS::KINSOL<dealii::Vector<double> >::solve(dealii::Vector<double>&)
#1  ./step-77.debug: Step77::MinimalSurfaceProblem<2>::run()
#2  ./step-77.debug: main
------------------------------------------------------

In release mode, only a segfault is obtained. Using the last version of Sundials (5.7.0) the code compiles but it does not converge. The following error is obtained in Debug mode:

Mesh refinement step 0
  Target_tolerance: 0.001

  Computing residual vector... norm=0.231202
  Computing Jacobian matrix
  Factorizing Jacobian matrix
  Solving linear system
  Computing residual vector... norm=0.231202
  Computing residual vector... norm=0.402211
  Computing residual vector... norm=0.627191
  Computing residual vector... norm=0.857696
  Computing residual vector... norm=1.08878
  Computing residual vector... norm=1.31995
  Computing residual vector... norm=1.55112
  Computing residual vector... norm=1.78231
  Computing residual vector... norm=2.0135
  Computing residual vector... norm=2.24469
  Computing residual vector... norm=2.47588
  Computing residual vector... norm=2.70708

[KINSOL ERROR]  KINSol
  The line search algorithm was unable to find an iterate sufficiently distinct from the current iterate.


--------------------------------------------------------
An error occurred in line <518> of file </home/laura/Local/dealii/source/sundials/kinsol.cc> in function
    unsigned int dealii::SUNDIALS::KINSOL<VectorType>::solve(VectorType&) [with VectorType = dealii::Vector<double>]
The violated condition was: 
    status >= 0
Additional information: 
    One of the SUNDIALS KINSOL internal functions returned a negative
    error code: -5. Please consult SUNDIALS manual.

Stacktrace:
-----------
#0  /home/laura/Local/dealii-build-sundials/lib/libdeal_II.g.so.10.0.0-pre: dealii::SUNDIALS::KINSOL<dealii::Vector<double> >::solve(dealii::Vector<double>&)
#1  ./step-77: Step77::MinimalSurfaceProblem<2>::run()
#2  ./step-77: main
--------------------------------------------------------

Anyone has any idea of how to fix this and make the tutorial work?

FYI @blaisb @oguevremont

@blaisb
Copy link
Member

blaisb commented Jul 23, 2021

@lpsaavedra I know how to allow 3.1 to run with step-77 (a function needs to be added), however, it does not change the fact that it does not converge :). Good catch.

@bangerth
Copy link
Member

This is aggravating -- it used to work just fine when I wrote the program not long before the release, and must have broken in the relatively short amount of time between the merge and the release :-( We will have to find which patch broke this -- in any case, I can confirm exactly the error @lpsaavedra shows above.

If you wanted to help, do you know how to bisect a git history?

@bangerth bangerth added this to the Release 10.0 milestone Jul 29, 2021
@bangerth bangerth added the Bug label Jul 29, 2021
@bangerth
Copy link
Member

I would specifically see if #12254 introduced the problem.

@lpsaavedra
Copy link
Contributor Author

If you wanted to help, do you know how to bisect a git history?

Yes, I will do it and let you know how it goes. Hopefully we find the patch soon.

@bangerth
Copy link
Member

Awesome! The number of patches between #11953 was merged and the 9.3 release is not huge, so hopefully you'll find it soon!

@lpsaavedra
Copy link
Contributor Author

#12216 was the one that introduced the problem

@peterrum
Copy link
Member

@lpsaavedra Maybe you could also take a look at the PRs referenced in issue #12223.

@blaisb
Copy link
Member

blaisb commented Aug 4, 2021

@peterrum , @luca-heltai , @bangerth What do you think is the best forward? @lpsaavedra has pinpointed the commit. I'd like to help her fix this, but I have to admit that this part of the KINSOL wrappers will take me a long time to understand. How would you like to approach this? It's a bit sad that step-77 is not working right now. Additionally, the KINSOL wrapper implementation right now seems fragile, yet we'd like to transition to using KINSOL as a non-linear solver in Lethe. If I can do anything to help fix this, please tell me. :)

@blaisb
Copy link
Member

blaisb commented Sep 14, 2021

Bumpity Bump :). I tried to look into this from numerous angles, but I still don't understand what's going on :(

@luca-heltai
Copy link
Member

So the error that KINSOL is throwing out seems to be: KIN LINESEARCH NONCONV (-5), that is The linesearch algorithm was unable to find an iterate sufficiently distinct from the current iterate. This looks like it could be a setting not being correct.

My guess is that in the earlier versions of the kinsol wrapper, some of the settings were not read correctly (we were, in fact, initializing KINSOL in the wrong manner). Maybe we are using the wrong tolerances somewhere, or we are not passing them correctly to KINSOL. I'll try to explore a bit.

@blaisb
Copy link
Member

blaisb commented Sep 15, 2021

So the error that KINSOL is throwing out seems to be: KIN LINESEARCH NONCONV (-5), that is The linesearch algorithm was unable to find an iterate sufficiently distinct from the current iterate. This looks like it could be a setting not being correct.

My guess is that in the earlier versions of the kinsol wrapper, some of the settings were not read correctly (we were, in fact, initializing KINSOL in the wrong manner). Maybe we are using the wrong tolerances somewhere, or we are not passing them correctly to KINSOL. I'll try to explore a bit.

That makes sense. I have found the behavior to be very erratic right now, so clearly there is a setting not set correctly (or maybe not even initialized). @lpsaavedra and I implemented KINSOL in Lethe and for some cases, the behavior of the first iteration is identical to a classical Newton's method, which is what I found expect ( a full newton step is taken), but for other cases, KINSOL is unable to proceed past a first iteration, even though a classical Newton method works perfectly well in that case. It is a very erratic behavior :). Don't hesitate to ping me if you need help with something.

@bangerth
Copy link
Member

Since we know which patch broke step-77, is there a way to undo individual parts of the patch to see what part broke the functionality?

@blaisb
Copy link
Member

blaisb commented Sep 22, 2021

Since we know which patch broke step-77, is there a way to undo individual parts of the patch to see what part broke the functionality?

That could be done. I just don't have a good understanding of the KINSOL wrappers. I can take a jab at it though and see which removed element broke things. It's just that it's such a blackbox wrapper that it is difficult to see what's going on.
My best bet right now is to solve problems and compare the results with what I obtain using a regular Newton method, at least for the first iteration, to see what breaks or not. I have not been able to find a logic regarding which cases work and which fails. I feel like this cat debugging this :

https://tenor.com/view/qa-cat-leak-flood-broken-pipe-gif-12195496

(which is by far my favorite programming gif :) ).

@blaisb
Copy link
Member

blaisb commented Oct 17, 2021

So I gave a look at the parameter that we initialise and the initial values we give them. The entire list of parameter that can be used is the following:
image

For some of our problems, I have managed to reach convergence by manually specifying a function_tolerance and a maxim_newton_step, which are two parameters which are by default fixed by uround. However, this is not really robust for all problems. Notably, I have not found any combination of parameters or anything that makes step-77 converges. It's quite the opposite, the problems blows up quite quickly.

I have also tried to look at what was changed in #12216 and identify could have been introduced that created this error, but this was unsuccessful.

At this point, i'll leave this into the hands of more capable people. I can't find the solution right now, even after looking through the KINSOL documentation. FYI @lpsaavedra.
@luca-heltai you might have a better idea on how to solve this, but in it's current shape, the wrapper just does not work robustly and robustness is the reason why one would want to go through KINSOL. :)

@luca-heltai
Copy link
Member

I think we have a subtle error in our implementation of the NVector type. The previous wrapper was copying data from and to the native NVector type of sundial. Our current wrapper avoids the copy, and implements the interface of NVector types. I suspect that somewhere in this implementation we have inserted a bug, but I was not able, so far, to spot where this could be.

@blaisb
Copy link
Member

blaisb commented Oct 18, 2021

@luca-heltai I agree with you. There is something very erratic with the behavior of the wrapper right now and I don't think it's just a parameter issue. What do you think is the best course of action? Revert back to the previous version of the wrapper with the copy until we have a fix?

@tamiko
Copy link
Member

tamiko commented Nov 3, 2021

I will go ahead and revert #12216 on the dealii-9.3 release branch for the point release.

@peterrum
Copy link
Member

@bangerth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants