Skip to content
/ scr Public
forked from LLNL/scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

License

Notifications You must be signed in to change notification settings

duzhuqi/scr

 
 

Repository files navigation

Scalable Checkpoint / Restart (SCR) Library

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.

Detailed usage is provided at SCR.ReadTheDocs.io.

User Docs Status

Quickstart

SCR uses the CMake build system and we recommend out-of-source builds.

./bootstrap.sh
mkdir build install
cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ..
make
make install
make test

Some useful CMake command line options:

  • -DCMAKE_INSTALL_PREFIX=[path]: Place to install the SCR library
  • -DCMAKE_BUILD_TYPE=[Debug/Release]: Build with debugging or optimizations
  • -DBUILD_PDSH=[OFF/ON]: CMake can automatically download and build the PDSH dependency
  • -DWITH_PDSH_PREFIX=[path to PDSH]: Path to an existing PDSH installation (should not be used with BUILD_PDSH)
  • -DWITH_DTCMP_PREFIX=[path to DTCMP]
  • -DWITH_YOGRT_PREFIX=[path to YOGRT]
  • -DSCR_ASYNC_API=[CRAY_DW/INTEL_CPPR/IBM_BBAPI/NONE]
  • -DSCR_RESOURCE_MANAGER=[SLURM/APRUN/PMIX/LSF/NONE]

Dependencies

  • C (with support for C++ and Fortran)
  • MPI
  • ECP-VELOC Components (ER, FILO, shuffile, redset, AXL, spath, kvtree, rankstr)
  • CMake, Version 2.8+
  • PDSH
  • DTCMP (optional)
  • libYOGRT (optional)
  • MySQL (optional)

Configuration Files

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

  1. Environment variables,
  2. User configuration file,
  3. System configuration file,
  4. Compile-time constants.

To find a user configuration file, SCR looks for a file named .scrconf in the prefix directory (note the leading dot). Alternatively, one may specify the name and location of the user configuration file by setting the SCR_CONF_FILE environment variable at run time. This repository includes some example configuration files (scr.conf.template, scr.user.conf.template, and examples/test.conf).

Authors

Numerous people have contributed to the SCR project.

To reference SCR in a publication, please cite the following paper:

Additional information and research publications can be found here:

http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

Developers

Developer documentation is provided at SCR-dev.ReadTheDocs.io.

Developer Docs Status

About

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C 64.9%
  • Perl 12.7%
  • Python 11.0%
  • Shell 8.0%
  • CMake 2.5%
  • C++ 0.9%