Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell_simple_3d crashes #123

Closed
bangerth opened this issue Jun 6, 2014 · 30 comments
Closed

shell_simple_3d crashes #123

bangerth opened this issue Jun 6, 2014 · 30 comments

Comments

@bangerth
Copy link
Contributor

bangerth commented Jun 6, 2014

When running shell_simple_3d.prm with 32 processors, it works for 16 time steps but then I get this:

*** Timestep 15: t=7.83037e+06 years
Solving temperature system... 21 iterations.
Solving Stokes system... 30+14 iterations.

Postprocessing:
RMS, max velocity: 0.0434 m/year, 0.162 m/year
Temperature min/avg/max: 973 K, 1409 K, 1973 K
Heat fluxes through boundary parts: -7.917e+12 W, 3.016e+12 W
Writing depth average output/depth_average.vtu

Number of active cells: 74,324 (on 6 levels)
Number of degrees of freedom: 3,053,696 (2,213,988+101,712+737,996)

*** Timestep 16: t=8.35066e+06 years
Solving temperature system... 19 iterations.
Rebuilding Stokes preconditioner...
Solving Stokes system... 30+19 iterations.

Postprocessing:
Writing graphical output: output/solution-00008
RMS, max velocity: 0.00439 m/year, 0.0298 m/year
Temperature min/avg/max: 973 K, 1408 K, 1973 K
Heat fluxes through boundary parts: -7.68e+12 W, 2.922e+12 W

*** Timestep 17: t=1.00836e+07 years
Solving temperature system...


An error occurred in line <828> of file </u/bangerth/p/deal.II/3/install-bottom/include/deal.II/lac/solver_gmres.h> in function
void dealii::SolverGMRES::solve(const MATRIX&, VECTOR&, const VECTOR&, const PRECONDITIONER&) [with MATRIX = dealii::TrilinosWrappers::SparseMatrix, PRECONDITIONER = dealii::TrilinosWrappers::PreconditionILU, VECTOR = dealii::TrilinosWrappers::MPI::Vector]
The violated condition was:
false
The name and call sequence of the exception was:
SolverControl::NoConvergence (this->control().last_step(), this->control().last_value())
Additional Information:

Iterative method reported convergence failure in step 737996 with residual 3.30962e+26

Does anyone else see this? I'm running with
mpirun -np 32 aspect ../Aspect/cookbooks/shell_simple_3d.prm

@ian-r-rose
Copy link
Contributor

I notice that the CFL number is rather large (1.0) for this parameter file, and that the crash happens on the second time step after a refinement. Is it possible that the bdf2 scheme is losing stability? That is to say, could the old_solution and the old_old_solution be sufficiently different across refinement levels that the quadratic extrapolation is not well behaved?

Did this cookbook work previously? A test of this theory would be (1) refining after, say, 5 timesteps instead of 15, or (2) turning off bdf2 and just using implicit Euler.

@bangerth
Copy link
Contributor Author

bangerth commented Jun 6, 2014

I'm quite certain I ran the input file pretty much as is to generate the corresponding section in the manual...

@tjhei
Copy link
Member

tjhei commented Jun 6, 2014

and I think we should be unconditionally stable so the CFL number should not be the problem.

This could be related to the change of the linear residual that we did.

@bangerth
Copy link
Contributor Author

bangerth commented Jun 6, 2014

On 06/06/2014 01:52 PM, Timo Heister wrote:

and I think we should be unconditionally stable so the CFL number should
not be the problem.

This could be related to the change of the linear residual that we did.

Good point. I've been running this on a branch here but would appreciate
if anyone else could try to reproduce on their end. It's going to take a
very long time because of the very large number of iterations it does in
the last step -- you probably want to limit that manually...

W.


Wolfgang Bangerth email: bangerth@math.tamu.edu
www: http://www.math.tamu.edu/~bangerth/

@ian-r-rose
Copy link
Contributor

Wasn't the change to the linear residual only in the Stokes system? It appears to me that the failure is in the solve for the temperature system.

@tjhei
Copy link
Member

tjhei commented Jun 8, 2014

I can reproduce the problem here (even with 4 cpus). Changing how often we do AMR doesn't dramatically change the result. I still see the crash with 2 instead of 3 initial adaptive steps (doesn't make the problem much smaller though).

@tjhei
Copy link
Member

tjhei commented Jun 9, 2014

I managed to get the crash much sooner (timestep 6 with 900k DoFs):
https://github.com/tjhei/aspect/blob/shell_3d_crash/cookbooks/shell_simple_3d.prm

@ian-r-rose
Copy link
Contributor

I can confirm that Timo's simplified prm file crashes on my workstation with 4 cores.

@tjhei
Copy link
Member

tjhei commented Jul 8, 2014

simplified the file some more (still timestep 6, see link in my last message). I tried removing the constraints.distribute() after refinement, but this is not the cause for the crash. The velocities look rather strange right before the crash.
image

@gassmoeller
Copy link
Member

I filled an hour of free-time with some tests concerning this problem, and while I have no solution, here are some observations and ways to circumvent the crash:

  1. The theoretical/analytical solution of the shell_simple_3d.prm should be exactly the one we use to calculate the initial residual (v=0 , adiabatic pressure), nonetheless it seems like the stokes solver converges, although to non-useful velocities (this is not surprising though and it usually just means that numerical oscillations/errors will initiate a usual convection after a while). As soon as one provides even the slightest perturbation to the initial temperature field, everything seems to work fine (for example use the Harmonic perturbation initial condition with a perturbation magnitude of 1 K)
  2. The crash only occurs with adaptive meshing, and always in the advection solver in the timestep after a timestep with remeshing. More precisely with Timo's input file it seems like the crash occurs after the first remeshing that resolves the whole inner thermal boundary layer with resolution 2. For this it is good to know that the solution to this parameter file is somewhat different with a global resolution of 1 and 2 (vrms around 4 cm/year vs. 0.5 cm/year) and the solution in the adaptive case jumps from the first one to the second one right in the timestep before the crash. Notice that the same jump in velocities occurs in Wolfgang's output above. One could argue that this is just a failed solution of the stokes solver, but it seems like the same solution you get with a global static resolution of 2, and there no crash happens. Therefore it seems to be the jump in the solution combined with the problems with the initial residual calculation for exactly this models setup is causing the trouble.

@tjhei
Copy link
Member

tjhei commented Sep 12, 2014

interestingly enough, GMRES is unable to reduce the residual at all:
initial residual: 2.15188e+28
rhs norm: 6.20482e+28
target tolerance: 6.20482e+16
residual in it 1: 1.98576e+28
residual after 1000 iterations: 1.98407e+28

If I switch from entropy viscosity to first order viscosity, everything is working fine. I am not sure if any of the old solution variables is wrong after mesh refinement, or if we for some reason do not apply enough stabilization.

@tjhei
Copy link
Member

tjhei commented Sep 12, 2014

The entropy variation jumps up by a factor of 2 in this step and the extrapolated_advection_field_range for the temperature is 973, 2398.36 instead of 973, 1973 in every step before.

@gassmoeller
Copy link
Member

Hmm, the velocity is changing a lot between the old solution and old old solution and the maximum entropy stabilization and the advection system residual is calculated with the mean velocity of the two, however the velocity is decreasing so in theory we should only have an overstabilization and not too less stabilization. With higher stabilization values for beta and c_R one can delay the crash to later time steps, but it still occurs.

@tjhei
Copy link
Member

tjhei commented Sep 12, 2014

You know what the problem could be? old_old_solution is interpolated, so it might not be divergence free.

The increase of the max temperature in global_field_range is not the problem...

@gassmoeller
Copy link
Member

Ok that jump in entropy variation is not surprising if the extrapolated_advection_field_range changes, because we approximate the average temperature by taking the arithmetic mean of maximal and minimal temperature, and (973+2398.36)/2 is much farther away from 1473 than (973+1973)/2. The interesting question would be why the extrapolated_advection_field_range changes. This would require having largely different old_solution and old_old_solution temperature values just because of the mesh refinement in the step before. By the way, do you know, why we use an iterated trapezoidal quadrature rule for the estimation of the extrapolated_advection_field_range? I was just confused because of the difference to the otherwise used Gauss quadrature in the assembly.

@tjhei
Copy link
Member

tjhei commented Sep 12, 2014

This is how the entropy viscosity looks like when the solution fails:
entropy

The trapezoidal rule evaluates the solution exactly at the support points instead of finding the min/max of the quadratic solution. Think of a function in 1d with values 0,1,1 at x=0,1,2 and a quadratic interpolation of these values. The quadratic function will overshoot and have a much large maximum. Why we do this here? I am not sure...

@gassmoeller
Copy link
Member

I know you wrote further up that the time stepping scheme should be unconditionally stable, but I would like to come back to this point for a moment. As far as I can see we calculate the entropy stabilization with the old_solution, however in the assembly for the equation we use the extrapolated current_linearization_point velocities and we know the velocities have changed a lot between old_old_solution and old_solution so the current_linearization_point will be quite different from old_solution (even more so, because the velocities are much lower now, and therefore the timestep length is now 3 times larger than in the timestep before). Additionally, if I for test purposes use the old_solution to assemble the matrix the solver converges well. Maybe calculating the entropy viscosity with the old solution is not accurate enough in this particular case?

@tjhei
Copy link
Member

tjhei commented Sep 15, 2014

Rene, so if you replace the extrapolated Stokes solution by old_solution in the entropy computation, it works?

@ian-r-rose
Copy link
Contributor

Has anybody attempted to do a git bisect for this problem?

On Sun, Sep 14, 2014 at 5:08 PM, Timo Heister notifications@github.com
wrote:

Rene, so if you replace the extrapolated Stokes solution by old_solution
in the entropy computation, it works?


Reply to this email directly or view it on GitHub
#123 (comment).

@gassmoeller
Copy link
Member

What I did so far was replacing the current_velocities in the matrix_assembly in assembly.cc:1597 by the old velocities. This is clearly wrong (changing our timestepping just there but nowhere else), but stable. Currently we use the old_velocities for calculating the entropy viscosity, but we might test to replace it by the current velocities. I think we did not do this, because it is some kind of a nonlinearity (basing the assembly of the matrix on a extrapolated quantity)? It is however a bit more work to replace all the calls to scratch.explicit_material_model by the normal material_model, and I am currently a bit short on schedule, so I can only look into it in two weeks or so.

@ian-r-rose: I do not think somebody did a bisect yet, however I remember faintly that I tried to run it with aspect-1.1 and it was already in there. Might be wrong though.

@ian-r-rose
Copy link
Contributor

Not a proper bisection, but I get the same broken behavior as far back as the switch to cmake. I didn't go further back due to not wanting to switch up the build system.

@bangerth
Copy link
Contributor Author

I recall that the issue predates the switch to git for sure. If it even predates the switch to cmake then it's indeed pretty old. I dimly recall having done a bisection at one point in the past and found that it was a complete innocently looking patch -- it may have been the one where Timo fiddled with computing the way we compute the relative tolerance for convergence. But I really can't say for sure any more. I think finding the patch that caused this would really help nail down the problem.

@tjhei
Copy link
Member

tjhei commented Sep 16, 2014

I went back all the way to 5385905 (May 2013), fought a bit to make it compile, and it fails in the same way. So I don't have a revision where my reduced code is working.

@bangerth
Copy link
Contributor Author

Is that before or after the section in the manual was added? I know I ran it when I wrote the section...

@tjhei
Copy link
Member

tjhei commented Sep 16, 2014

before. The file got added in Feb 2014:

$ git log --reverse -- cookbooks/shell_simple_3d.prm
commit 29c8681c39a2ea4216604892cc89470ca541be4f
Author: Wolfgang Bangerth <bangerth@math.tamu.edu>
Date:   Mon Feb 17 19:16:21 2014 +0000

    Annotate some parameter files. Add a couple of new ones. Move those that hav

What do you propose? Maybe it worked back then but only on a finer mesh?

@tjhei
Copy link
Member

tjhei commented Sep 16, 2014

(I am now rerunning your original shell_3d when you checked it in, we'll see where that goes)

@tjhei
Copy link
Member

tjhei commented Sep 16, 2014

Original shell_3d (b5241f3) crashes in step 17 (as W. wrote in the first message in this thread). So either it never worked with this resolution, or the changes were inside deal.II.

@tjhei
Copy link
Member

tjhei commented Jan 19, 2015

good news: it looks like it is working with the recent manifold changes (my file, 42 timesteps and running). We need to do some more testing, though. 👍

@bangerth
Copy link
Contributor Author

Let's get a release out right now, quick, pronto, before anyone finds something that doesn't work! ;-)

@bangerth
Copy link
Contributor Author

bangerth commented Feb 3, 2016

Confirmed to work now. Not fast, but that's because it's a big model I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants