-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outputting correct "total wall time" in log file after restart #4168
Comments
I agree, that would be a nice feature, but I think there is some functionality in deal.II missing to do this. We store all of the timing information in an object of type There are also some subtle problems, for example each MPI process stores its own timer, and the output you see on screen is an average (or maximum, I dont remember) over all MPI processes. But when checkpointing we would only want to store a single timer, not one per MPI process. Thus after the checkpoint the timer would be different that if we had just continued the simulation. Maybe this does not matter and it is still worth doing it, but it would require changes in the ComputingTimer class inside deal.II. If you would like to do this we should check in with @bangerth or @tjhei, because I think there may be other subtle problems (like what happens with the currently open timing sections upon restart, are we still in that section?). Summarizing, it is a nice feature, but it will be more complicated to implement than you might think. |
Thanks a lot for this @gassmoeller. I would still like to do this as my current wall time limit is 12 hours and I'm doing long run benchmarks against StagYY. I think other users would also appreciate knowing the total run time. I would be happy to chat with @bangerth and @tjhei when you have the time :) |
I agree that a generic solution as described by @gassmoeller is somewhat complicated and requires changes in deal.II. A simpler approach would be a single timer on rank 0 for the total wall time. We would store the number of seconds elapsed when checkpointing and print the sum of them. |
Step 1 would indeed be to implement the missing serialization of the We currently only save information from process 0 and, on restart, distribute that to all other processes. If that means that we lose a bit of information because the |
fixed by #4332 |
For users wanting to compare run speed for different models over a longer time period it would be useful to output the correct "total wall time" in the log file after a restart.
The text was updated successfully, but these errors were encountered: