Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
When using DMTCP, you will need two submit scripts: one for launching under checkpoint control, and one for restarting from a crashed job. For example, for SLURM, you would modify slurm_launch.job to change "<your_binary>" in the line for dmtcp_launch. The default script does not automatically checkpoint. Search on "dmtcp_command" for instructions on how to use it to manually request a checkpoint. Do "dmtcp_command -h" to see the options for "dmtcp_command". Alternatively, search on "start_coordinator" in slurm_launch.job, and add "-i 3600" to create a checkpoint every 3600 seconds (every hour). "dmtcp_coordinator -h" and "dmtcp_launch -h" also exist. When ready, execute the SLURM command: sbatch slurm_launch.job Upon checkpointing, a script, dmtcp_restart_script.sh, will be saved in the local directory, along with the checkpoint image files. When restarting, slurm_rstr.job assumes that the script dmtcp_restart_script.sh is in the local directory. The default for the restart script is for manually requested checkpointing. See the above instructions and "dmtcp_restart -h" for setting checkpoints at regular time intervals. Modify slurm_rstr.job if automatic checkpointing is desired. Finally, it suffices to run: sbatch slurm_rstr.job