Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputting correct "total wall time" in log file after restart #4168

Closed
elodie-kendall opened this issue Jul 9, 2021 · 6 comments
Closed

Comments

@elodie-kendall
Copy link
Contributor

For users wanting to compare run speed for different models over a longer time period it would be useful to output the correct "total wall time" in the log file after a restart.

@gassmoeller
Copy link
Member

I agree, that would be a nice feature, but I think there is some functionality in deal.II missing to do this. We store all of the timing information in an object of type ComputingTimer which is a deal.II class (see include/aspect/simulator.h:1879). Unfortunately this class does not seem to support serialization using the boost serialization mechanism that we use to store other information (like current model time) so we can not simply write it into a checkpoint like the other properties (see source/simulator/checkpoint_restart.cc:600).

There are also some subtle problems, for example each MPI process stores its own timer, and the output you see on screen is an average (or maximum, I dont remember) over all MPI processes. But when checkpointing we would only want to store a single timer, not one per MPI process. Thus after the checkpoint the timer would be different that if we had just continued the simulation. Maybe this does not matter and it is still worth doing it, but it would require changes in the ComputingTimer class inside deal.II.

If you would like to do this we should check in with @bangerth or @tjhei, because I think there may be other subtle problems (like what happens with the currently open timing sections upon restart, are we still in that section?). Summarizing, it is a nice feature, but it will be more complicated to implement than you might think.

@elodie-kendall
Copy link
Contributor Author

Thanks a lot for this @gassmoeller. I would still like to do this as my current wall time limit is 12 hours and I'm doing long run benchmarks against StagYY. I think other users would also appreciate knowing the total run time. I would be happy to chat with @bangerth and @tjhei when you have the time :)

@tjhei
Copy link
Member

tjhei commented Jul 21, 2021

I agree that a generic solution as described by @gassmoeller is somewhat complicated and requires changes in deal.II. A simpler approach would be a single timer on rank 0 for the total wall time. We would store the number of seconds elapsed when checkpointing and print the sum of them.

@bangerth
Copy link
Contributor

Step 1 would indeed be to implement the missing serialization of the TimerOutput class in deal.II.

We currently only save information from process 0 and, on restart, distribute that to all other processes. If that means that we lose a bit of information because the TimerOutput objects are different on different processes, I think that's acceptable.

@elodie-kendall
Copy link
Contributor Author

Thanks! so you think that the simpler approach @tjhei suggested would not work @bangerth ?

@tjhei
Copy link
Member

tjhei commented Oct 28, 2021

fixed by #4332

@tjhei tjhei closed this as completed Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants