FTLib (Fault-Tolerant Library) is a framework to keep data-parallel distributed training continue regardless worker loss or join. It exposes collective communication APIs with fault-tolerance support by gluing a consensus
to a communication library
, both of which can be user-specific. A distributed training using FTLib is able to continue as long as at least one single worker is alive and when new workers join the training.
Prototyping
TODO Please refer to the design docs.
- Less reliable infrastructure/script
Distributed training jobs running on less reliable infrastructure risks more as any worker or communication failure will leads to the termination of the entire job.
- Dynamic workload system
A system may reduce the total workload of distributed training jobs to release resources so that resource can be squeezed out for jobs with higher priority. Without such jobs with higher-priority, the system can increase the workload to avoid resource idling.
The requirements for using FTLib
differs with choices of consensus and communication library. Please refer the requirements.txt
under each consensus and communication library(Not available, still in todo list).
Please refer test
for details on how to use FTLib
in distributed training.
.
├── CHANGELOG.md
├── deploy
├── docs
│ ├── design
│ └── imgs
├── ftlib
│ ├── consensus
│ ├── commlib
│ ├── ftlib_status.py
│ ├── __init__.py
│ └── rank_assign_scheme.py
├── LICENSE
├── OWNERS
├── README.md
├── requirements.txt
├── ROADMAP
├── scripts
└── test
FTLib is Apache license. Implementations of consensus and communication library may come with different licenses.