bioboxes - Standards for Interoperable Bioinformatics Containers
- Version: 0.8.1
- Maintainer: Michael Barton email@example.com
The purpose of this subsequent documents is provide a detailed specification for developers to write standardised bioinformatics containers. The goal of this document is to define a standard whereby bioinformatics software containers of the same type are interoperable and therefore can used interchangeably. The audience of this document are bioinformaticians and developers writing bioinformatics software shared using Linux containers. This document will describe the interface that MUST be provided to a running container and that a developer of the bioinformatics container MUST write their software against.
The scope of this standard is bioinformatics software packaged using Linux containers. Bioinformatics software in a Linux container can be shared and provided to third parties because software dependencies are included within the container. Examples of bioinformatics software are genome assemblers, read binners and read aligners. Examples of container software are Docker, Rocket and LXC/LXD. Standardising bioinformatics software in containers allows interchangeable use between different research groups and institutions.
Applications of this standardisation are:
- A developer uploads his short read aligner as a container to an online repository for others to use. A biologists downloads this aligner and is able to use it immediately as it follows a standardised interface that the biologist is already familiar with.
- A genome assembly benchmarking service downloads many genome assembler containers. These containers are evaluated using assembly performance metrics. The standardised interface allows all containers to be benchmarked the same way.
- A large sequencing centre invests time to develop an improved genome assembly pipeline for single cell data. The pipeline is packaged inside a Linux container and shared with the bioinformatics community. Another large sequencing centre is able to immediately compare this new pipeline with their in-house pipeline using the same container interface.
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119].
PAIRED: Paired reads are defined as the organisation of a FASTA or FASTQ file where the Nth and Nth+1 reads originate from opposite ends of the same DNA fragment, where N % 2 == 0 using 0-based indexing.
Generic bioinformatics container
This specification describes the required inputs for all containerised bioinformatics software, independent of the application type.
- TASK: The argument given to start a container MUST be a single string containing only the characters A-Z, a-z, 0-9, '_' and '-'. This argument is used to differentiate different combination of settings the containerised software can be run as. Every container SHOULD support a 'default' task. This runs the container in a mode that is applicable to the most common situation in which the software is used.
The containerised software MUST return a zero exit code when completing successfully, and return a non-zero exit code when an error occurs.
This section describes the variables containing the paths to various databases.
- CONT_DATABASES_DIR: This variable specifies the absolute path to a directory that contains the following databases:
- NCBI Genomes
- BLAST DBs
- [RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels”, BCP 14, RFC 2119, March 1997.