Outputting correct "total wall time" in log file after restart #4168

elodie-kendall · 2021-07-09T16:49:01Z

For users wanting to compare run speed for different models over a longer time period it would be useful to output the correct "total wall time" in the log file after a restart.

gassmoeller · 2021-07-09T19:22:09Z

I agree, that would be a nice feature, but I think there is some functionality in deal.II missing to do this. We store all of the timing information in an object of type ComputingTimer which is a deal.II class (see include/aspect/simulator.h:1879). Unfortunately this class does not seem to support serialization using the boost serialization mechanism that we use to store other information (like current model time) so we can not simply write it into a checkpoint like the other properties (see source/simulator/checkpoint_restart.cc:600).

There are also some subtle problems, for example each MPI process stores its own timer, and the output you see on screen is an average (or maximum, I dont remember) over all MPI processes. But when checkpointing we would only want to store a single timer, not one per MPI process. Thus after the checkpoint the timer would be different that if we had just continued the simulation. Maybe this does not matter and it is still worth doing it, but it would require changes in the ComputingTimer class inside deal.II.

If you would like to do this we should check in with @bangerth or @tjhei, because I think there may be other subtle problems (like what happens with the currently open timing sections upon restart, are we still in that section?). Summarizing, it is a nice feature, but it will be more complicated to implement than you might think.

elodie-kendall · 2021-07-21T10:38:28Z

Thanks a lot for this @gassmoeller. I would still like to do this as my current wall time limit is 12 hours and I'm doing long run benchmarks against StagYY. I think other users would also appreciate knowing the total run time. I would be happy to chat with @bangerth and @tjhei when you have the time :)

tjhei · 2021-07-21T11:27:38Z

I agree that a generic solution as described by @gassmoeller is somewhat complicated and requires changes in deal.II. A simpler approach would be a single timer on rank 0 for the total wall time. We would store the number of seconds elapsed when checkpointing and print the sum of them.

bangerth · 2021-07-21T12:03:52Z

Step 1 would indeed be to implement the missing serialization of the TimerOutput class in deal.II.

We currently only save information from process 0 and, on restart, distribute that to all other processes. If that means that we lose a bit of information because the TimerOutput objects are different on different processes, I think that's acceptable.

elodie-kendall · 2021-08-23T10:22:57Z

Thanks! so you think that the simpler approach @tjhei suggested would not work @bangerth ?

tjhei · 2021-10-28T21:14:42Z

fixed by #4332

bangerth mentioned this issue Jul 21, 2021

Improve restart start-up #4269

Closed

tjhei closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outputting correct "total wall time" in log file after restart #4168

Outputting correct "total wall time" in log file after restart #4168

elodie-kendall commented Jul 9, 2021

gassmoeller commented Jul 9, 2021

elodie-kendall commented Jul 21, 2021

tjhei commented Jul 21, 2021

bangerth commented Jul 21, 2021

elodie-kendall commented Aug 23, 2021

tjhei commented Oct 28, 2021

Outputting correct "total wall time" in log file after restart #4168

Outputting correct "total wall time" in log file after restart #4168

Comments

elodie-kendall commented Jul 9, 2021

gassmoeller commented Jul 9, 2021

elodie-kendall commented Jul 21, 2021

tjhei commented Jul 21, 2021

bangerth commented Jul 21, 2021

elodie-kendall commented Aug 23, 2021

tjhei commented Oct 28, 2021