Slurm DMTCP

Manuel Rodríguez-Pascual edited this page Jan 9, 2018 · 5 revisions

Intro

This document presents a complete description on the integration of Slurm and DMTCP. It includes installation, configuration and testing.

Current Slurm version on this work is 17.11.0-0pre2

Functionality

The basic idea behind this plugin is to provide checkpoint/restart support for every batch job running on a cluster. This means that once it is activated (by configuring it on Slurm, as will be detailed later) every job submitted with sbatch is by default run with DMTCP support.

This plugin will provide two flags to Slurm

  • configure.sh: --with-dmtcp= . This flag allows to indicate where Slurm should search for DMTCP files.
  • sbatch: --no-dmtcp . this flag allows to indicate that job will NOT have DMTCP support. Note that in the case of preemption the job will be cancelled, without the possibility of being restarted.

Required material

Hardware

Right now DMTCP does NOT support Omnipath. It will probably be supported by the end of the year though. All these experiments were performed on a Infiniband cluster.

OS

The plugin has been successfully tested with CentOS 7. Probably many other work correctly.

DMTCP does not support checkpoint/restart on heterogeneous systems. Make sure that the libraries are the same (including version) all around your cluster. In particular, double check the infiniband drivers

ls -l /usr/lib64/libmlx*
ls -l /usr/lib64/libibverbs.so*

Software stack

This is the list of software known to support task migration. Note that it contains specific branches of each application. It will be updated, so the content here is (hopefully) always correct.

Please note that the following information is only relative to this project, NOT a full guide on DMTCP/Slurm/MVAPICH installation and configuration. In particular, configuring Slurm for the first time is not a straightforward process, so do not hesitate to contact us if you are having issues.

DMTCP

Download link: https://github.com/jiajuncao/dmtcp/tree/ib-id-full-virtualization .

Compiled with

./configure --prefix=/home/localsoft/dmtcp --enable-infiniband-support

Slurm

Download link: https://github.com/supermanue/slurm/tree/DMTCP .

Note: you will get the correct branch if you download the .zip file from previous link. If you clone the project, you will automatically be directed to the main branch (this is a git standard, not our decision). In that case you have to change to DMTCP branch with your favourite git command.

steps for Slurm installation:

  • run "sh autogen.sh" on the root folder. This creates the necessary files for the compilation.

  • run "./configure.sh" with "--with-dmtcp" flag, plus any other flag you want to use. If dmtcp executables are not in your path, then compile with "--with-dmtcp=path".

You should see something like: “checking for dmtcp installation... /dmtcp/” . If you see something like “configure: WARNING: unable to locate dmtcp installation”, it means that DMTCP has not been found.

  • "make && make install" as usual

MPI

Mvapich tested version: mvapich2-2.2, downloaded from http://mvapich.cse.ohio-state.edu/downloads/.

Compiled manually with

MVAPICH2 configure:   	--prefix=/home/localsoft/mvapich2 --disable-mcast --with-slurm=/home/localsoft/slurm --with-pmi=slurm --with-pm=none -enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo --enable-mpit-pvars=all --enable-check-compiler-flags --enable-threads=multiple --enable-weak-symbols --enable-fast-install --enable-g=dbg --enable-error-messages=all --enable-error-checking=all

OpenMPI tested version: openMPI 2.0.3, dowlonaded from https://www.open-mpi.org/software/ompi/v2.0/

Note that in principle, any MPI library should work. If not, please contact Jiajun Cao (mail is at the end of this document) or Dr. Cooperman.

Slurm Configuration

in slurm.conf

CheckpointType=checkpoint/dmtcp   <— this is the only different thing with any other checkpoint plugin
JobCheckpointDir=/home/localsoft/slurm/checkpoint  (or wherever you want)

in plugstack.conf

optional          /home/localsoft/slurm/lib/slurm/checkpoint_dmtcp.so

This is required with any checkpoint lib. But just to remember it…

mkdir /home/localsoft/slurm/checkpoint/
chmod 777  /home/localsoft/slurm/checkpoint/

Recommended TMPDIR configuration

In order to ensure that the restart operation will succeed, this plugin saves all the information in TMPDIR owned by job owner. This can represent a significant overhead.

In order to minimize this overhead, our suggestion is to have a devoted TMPDIR for each job. This can be done with the usage of prolog and epilog scripts. They are as follows:

more prolog.sh

#!/bin/sh
tmpFolder="/SCRATCH_LOCAL/$SLURM_JOB_USER/$SLURM_JOB_ID"
mkdir -p $tmpFolder
echo "export TMPDIR=$tmpFolder"

---
---

more epilog.sh

#!/bin/sh
tmpFolder="/SCRATCH_LOCAL/$SLURM_JOB_USER/$SLURM_JOB_ID"
rm -rf $tmpFolder

and in slurm.conf,

TaskProlog=/home/localsoft/slurm/scripts/prolog.sh  #put path to your script here
Epilog=/home/localsoft/slurm/scripts/epilog.sh

TEST FILES

These are the files required to test all the functionalities.

serial

File

-bash-4.2$ more helloWorld.c
/* C Example */
#include <stdio.h>

int main (argc, argv)
     int argc;
     char *argv[];
{

  printf( "Hello world\n");

  int i = 0;
  for (i = 0; i < 10; i++){
	sleep(1);
	printf("%d\n",i);
	}

  printf( "Goodbye world\n");

  return 0;
}

submission script

-bash-4.2$ more helloWorld.sh
#!/bin/sh

echo "INITDATE=`date +%s`"

./helloWorld

echo "ENDDATE=`date +%s`"

submission script LONG

-bash-4.2$ more helloWorldLong.sh
#!/bin/sh
echo "INITDATE=`date +%s`"

sleep 300
echo "ENDDATE=`date +%s`"

MPI

File

/* C Example */
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>

#include <time.h>

int get_time_ms (void)
{
    struct timeval te;
    gettimeofday(&te, NULL); // get current time
    long long milliseconds = te.tv_sec*1000LL + te.tv_usec/1000; // caculate milliseconds
    return milliseconds;
}

int main (argc, argv)
     int argc;
     char *argv[];
{
  int rank, size, namelen, i;

  long long initTime, endTime, exectime;

  initTime = get_time_ms();

  MPI_Init (&argc, &argv);  /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);    /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, &size);    /* get number of processes */

  char   processor_name[MPI_MAX_PROCESSOR_NAME];
  MPI_Get_processor_name(processor_name,&namelen);

  printf( "Hello world from process %d of %d on \n", rank, size, processor_name );
  printf( "In process %d, SLURM_LOCALID=%s\n", rank, getenv("SLURM_LOCALID") );

  for (i = 0; i < 20; i++){
	sleep(1);
	printf("%d on process  %d of %d\n",i,  rank, size );
	}

  printf( "In process %d END , SLURM_LOCALID=%s\n", rank, getenv("SLURM_LOCALID") );
  printf( "Goodbye world from process %d of %d\n", rank, size );

  endTime =  get_time_ms();
  exectime= endTime - initTime;
  printf("exectime:  %lld\n", exectime);

  MPI_Finalize();

}

submission script


#!/bin/sh

echo "INITDATE=`date +%s`"

srun ./helloWorldMPI

echo "ENDDATE=`date +%s`"

OpenMP

File

#include <stdio.h>

int main (int argc, char *argv[]) {
int rank, size;

//compiled with gcc -fopenmp  helloWorldOpenMP.c -o
#pragma omp parallel private(rank, size)
 {
 rank = omp_get_thread_num();
 printf("Hello World from thread = %d\n", rank);

 if (rank == 0) 
 {
  size = omp_get_num_threads();
 printf("Number of threads = %d\n", size);
 }
 
  int i = 0;
  for (i = 0; i < 20; i++){
	sleep(1);
	printf("%d on thread %d of %d\n",i, rank, size);
	}
  printf( "Goodbye world from process %d of %d\n", rank, size );
 } 
}

Submission script

#!/bin/sh
export OMP_NUM_THREADS=2

srun ./helloWorldOpenMP

Hybrid MPI+OpenMP

File

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <mpi.h>
#include <omp.h>

int main (int argc, char** argv){
    int rank, size;
    MPI_Init (&argc, &argv);
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);
    int nthreads, tid;
    #pragma omp parallel private(tid)
    {
        tid = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        printf("Hello world from OpenMP thread %d/%d on MPI process %d/%d. I am running on core %d of node %s.\n", tid,nthreads,rank,size,sched_getcpu(),getenv("SLURM_NODEID"));

        int i = 0;
        for (i = 0; i < 20; i++){
      	   sleep(1);
      	    printf("%d thread %d/%d on MPI process %d/%d \n",i, tid,nthreads,rank,size);
      	   }
        printf("Goodbye world from OpenMP thread %d/%d on MPI process %d/%d. \n", tid,nthreads,rank,size);


    }
    MPI_Finalize();
    return 0;
}


Submission script

#!/bin/sh
export OMP_NUM_THREADS=2

srun ./helloWorldMPIOpenMP

Tests

This is a full test of the plugin. It can be employed to test the different functionalities and make sure that everything works as expected.

Serial

1.- No DMTCP

This executes the job with no checkpoint support.

sbatch --no-dmtcp helloWorld.sh

2.- DMTCP. Just init

This executes the job with checkpoint support. As you can see, the support is enabled by default.

sbatch helloWorld.sh

3.- DMTCP. Checkpoint+Resume

This creates a checkpoint and the execution continues.

sbatch helloWorld.sh
control checkpoint create <job_id>

4.- DMTCP. Checkpoint + Restart. Same node.

This creates a checkpoint and cancels the execution (vacate) and then restarts the job (restart). Please do not send any more job to the system, so you can ensure that the job is running on the same node before and after the restart.

sbatch helloWorld.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

5.- DMTCP. Checkpoint + Restart. Different node.

This experiment is just like the previous one, but the job is running on a different node before and after the checkpoint/restart operation. Here, a "helloWorldLong" job is submitted to make sure that the nodes just vacated were unavailable for the restart, thus enforcing our job to restart on a different one.

sbatch helloWorld.sh
scontrol checkpoint vacate <job_id>
sbatch --exclusive helloWorldLong.sh
scontrol checkpoint restart  <job_id>

MPI

1.- No DMTCP. Same node

sbatch -n 2 --tasks-per-node=2 --no-dmtcp helloWorldMPI.sh

2.- No DMTCP. Multiple nodes

sbatch -n 2 --tasks-per-node=1 --no-dmtcp helloWorldMPI.sh

3.- DMTCP. Just init. Same node

sbatch -n 2 --tasks-per-node=2 helloWorldMPI.sh

4.- DMTCP. Just init. Multiple nodes

sbatch -n 2 --tasks-per-node=1 helloWorldMPI.sh

5.- DMTCP. Checkpoint+Resume. Single node.

sbatch -n 2 --tasks-per-node=2 helloWorldMPI.sh
scontrol checkpoint create <job_id>

6.- DMTCP. Checkpoint+Resume. Multiple nodes.

sbatch -n 2 --tasks-per-node=1 helloWorldMPI.sh
scontrol checkpoint create <job_id>

7.- DMTCP. Checkpoint + Restart. Single node. Restart on the same node

sbatch -n 2 --tasks-per-node=2 helloWorldMPI.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

8.- DMTCP. Checkpoint + Restart. Multiple nodes. Restart on the same nodes

sbatch -n 2 --tasks-per-node=1 helloWorldMPI.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

9.- DMTCP. Checkpoint + Restart. Single node. Restart on a different node.

sbatch -n 2 --tasks-per-node=2 helloWorldMPI.sh
scontrol checkpoint vacate <job_id>
sbatch --exclusive helloWorldLong.sh
scontrol checkpoint restart  <job_id>

TODO 10.- DMTCP. Checkpoint + Restart. Multiple nodes. Restart on a different node.

This should work but right now I do not have a cluster with enough nodes to test it...

OpenMP

1.- No DMTCP

This executes the job with no checkpoint support.

sbatch --no-dmtcp helloWorldOpenMP.sh

2.- DMTCP. Just init

This executes the job with checkpoint support. As you can see, the support is enabled by default.

sbatch helloWorldOpenMP.sh

3.- DMTCP. Checkpoint+Resume

This creates a checkpoint and the execution continues.

sbatch helloWorldOpenMP.sh
control checkpoint create <job_id>

4.- DMTCP. Checkpoint + Restart. Same node.

This creates a checkpoint and cancels the execution (vacate) and then restarts the job (restart). Please do not send any more job to the system, so you can ensure that the job is running on the same node before and after the restart.

sbatch helloWorldOpenMP.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

5.- DMTCP. Checkpoint + Restart. Different node.

This experiment is just like the previous one, but the job is running on a different node before and after the checkpoint/restart operation. Here, a "helloWorldLong" job is submitted to make sure that the nodes just vacated were unavailable for the restart, thus enforcing our job to restart on a different one.

sbatch helloWorldOpenMP.sh
scontrol checkpoint vacate <job_id>
sbatch --exclusive helloWorldLong.sh
scontrol checkpoint restart  <job_id>

Hybrid MPI + OpenMP

1.- No DMTCP. Same node

sbatch -n 2 --tasks-per-node=2 --no-dmtcp helloWorldMPIOpenMP.sh

2.- No DMTCP. Multiple nodes

sbatch -n 2 --tasks-per-node=1 --no-dmtcp helloWorldMPIOpenMP.sh

3.- DMTCP. Just init. Same node

sbatch -n 2 --tasks-per-node=2 helloWorldMPIOpenMP.sh

4.- DMTCP. Just init. Multiple nodes

sbatch -n 2 --tasks-per-node=1 helloWorldMPIOpenMP.sh

5.- DMTCP. Checkpoint+Resume. Single node.

sbatch -n 2 --tasks-per-node=2 helloWorldMPIOpenMP.sh
scontrol checkpoint create <job_id>

6.- DMTCP. Checkpoint+Resume. Multiple nodes.

sbatch -n 2 --tasks-per-node=1 helloWorldMPIOpenMP.sh
scontrol checkpoint create <job_id>

7.- DMTCP. Checkpoint + Restart. Single node. Restart on the same node

sbatch -n 2 --tasks-per-node=2 helloWorldMPIOpenMP.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

8.- DMTCP. Checkpoint + Restart. Multiple nodes. Restart on the same nodes

sbatch -n 2 --tasks-per-node=1 helloWorldMPIOpenMP.sh
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart  <job_id>

9.- DMTCP. Checkpoint + Restart. Single node. Restart on a different node.

sbatch -n 2 --tasks-per-node=2 helloWorldMPIOpenMP.sh
scontrol checkpoint vacate <job_id>
sbatch --exclusive helloWorldLong.sh
scontrol checkpoint restart  <job_id>

TODO 10.- DMTCP. Checkpoint + Restart. Multiple nodes. Restart on a different node.

This should work but right now I do not have a cluster with enough nodes to test it...

Preemption

1.- Slurm Preemption mechanism.

Detailed information on preemption can be found here https://slurm.schedmd.com/preempt.html . The following is just some notes on our tests.

Slurm configuration to support preemption

PreemptMode=REQUEUE # requeue means "checkpoint and put it back into the queue to restart whenever"
PreemptType=preempt/partition_prio

PartitionName=normal Nodes=acme[12,32] default=YES MaxTime=23:59:00 State=UP  PriorityTier=10
PartitionName=highPriority  Nodes=acme[12,32] default=no MaxTime=23:59:00 State=UP PriorityTier=100

Test:

sbatch -n 2 --tasks-per-node=1 --exclusive helloWorldMPI.sh  #The important thing here is to include "--with-dmtcp" and make it big enough to fill the whole cluster, so there are no free nodes where the job with higher priority could run

sbatch -p highPriority helloWorldLong.sh

Troubleshooting and open problems

scontrol_checkpoint error: Duplicate job id

Short answer:

this is normal. You just have to wait a little bit before restarting.

Long answer:

You are now facing with a design decision on Slurm that I don't particularly like, although it probably makes sense.

Every Slurm job has an ID, JOBID. When anything happens to that job (it starts, it ends...), that status change is saved into a cache. The idea is not to overload the database. Then, every PURGE_JOB_INTERVAL, Slurm checks the status of all jobs and updates the database.

When you try to restart a job, Slurm checks the database but not the cache. If this purge has not happened, the information in database says "job is running", so Slurm complains. After the purge, database says "job is not running anymore" and job can be restarted.

This parameter, PURGE_JOB_INTERVAL, cannot be set from the outside and must be modified within the code, in src/slurmctld/slurmctld.h (this is ugly, I know). Default value is 300 seconds, so jobs will have to wait for 150 seconds on average before being allowed to be restarted. Anyway in small clusters you can reduce the value: I am using 60 seconds and the overhead is negligible, probably could be set to 5 seconds or so without even noticing.

limits to preemption

Preemption on Slurm is an open discussion, and any feedback is welcome.

Main problems:

  • when preemption is configured, it is applied whenever needed. This means that all the jobs will be checkpointed if Slurm decides that. The problem is that if the job is executed WITH "--no-dmtcp", it just gets cancelled and will not restart.

  • DMTCP cannot checkpoint every job. In particular,

    • those using GPUs cannot be saved/restored.
    • the ones run with "srun" should not be preempted
    • same with "salloc"
  • Behavior of sbatch flag "--no-requeue" is not what I expected. What happens is that the job is just cancelled, instead of checkpoint/restarted. Is this the behavior we want?

People

  • Slurm is developed by SchedMD. The best way of solving any issue with Slurm is probably the developers mailing list, which right now acts as the meeting point of the user community.

  • DMTCP is developed by the High Performance Computing Lab of Northeastern University. The head of the group is Dr. Gene Cooperman (mail: gene "at" ccs dot neu dot edu ). The infiniband driver and most of the development required for this project has been carried out by Jiajun Cao (mail: jiajun "at" ccs dot neu dot edu ). Please contact both of them in the case of an issue with DMTCP.

  • the integration of both tools via a Slurm plugin has been carried out by Manuel Rodríguez Pascual (mail: manuel dot rodriguez dot pascual "at" gmail dot com) with great support from DMTCP team. Contact him for anything related to this.

  • Ulf Markwardt and Maik Schmidt from Technische Universität Dresden have collaborated in the debug process, made many useful suggestions and kindly provided access to their test cluster, which has been a key resource on this development.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.