Skip to content

Commit

Permalink
DOC: proofing fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
beardymcjohnface committed Feb 26, 2024
1 parent 924f151 commit cb365b7
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,12 @@ bibliography: paper.bib

# Summary

Genomes of organisms are constructed by assembling sequencing from whole genome sequencing (WGS). It is useful to determine sequencing read-coverage of the genome assembly, for instance identifying duplication or deletion events, related contigs for binning metagenomes [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomes [@condiga]. Although calculating read-coverage is a routine task, it typically involves several complete read and write operations (I/O operations). This is not a problem for small datasets, but can be a significant bottleneck for very large datasets. Koverage reduces I/O burden as much as possible to enable maximum scalability. Koverage includes a kmer-based method that significantly reduces the computational complexity for very large reference genomes. Koverage uses Snakemake [@snakemake], providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface, and is installable with PIP or Conda for maximum ease of use. Source code and documentation are available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).
Genomes of organisms are constructed by assembling sequence reads from whole genome sequencing. It is useful to determine sequence read-coverage of genome assemblies, for instance identifying duplication or deletion events, identifying related contigs for binning metagenomes [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomes [@condiga]. Although calculating read-coverage is a routine task, it typically involves several complete read and write operations (I/O operations). This is not a problem for small datasets, but can be a significant bottleneck for very large datasets. Koverage reduces I/O burden as much as possible to enable maximum scalability. Koverage includes a kmer-based method that significantly reduces the computational complexity for very large reference genomes. Koverage uses Snakemake [@snakemake], providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface, and is installable with PIP or Conda for maximum ease of use. Source code and documentation are available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).


# Statement of need

With the current state of sequencing technologies, it is trivial to generate terabytes of sequencing data for hundreds or even thousands of samples. Databases such as the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA), containing nearly 100 petabytes combined of sequencing data, are constantly being mined and reanalysed in bioinformatics analyses. Memory and I/O bottlenecks lead to under-utilisation of CPUs, and computational inefficiencies at such scales waste thousands of dollars in compute costs. I/O heavy processes in large parallel batches can result in significantly impaired performance. This is especially true for HPC clusters with shared file storage, or for cloud environments using cost-efficient bucket storage.
With the current state of sequencing technologies, it is trivial to generate terabytes of sequencing data for hundreds or even thousands of samples. Databases such as the Sequence Read Archive and the European Nucleotide Archive, containing nearly 100 petabytes combined of sequencing data, are constantly being mined and reanalysed in bioinformatics analyses. Memory and I/O bottlenecks lead to under-utilisation of CPUs, and computational inefficiencies at such scales waste thousands of dollars in compute costs. I/O heavy processes in large parallel batches can result in significantly impaired performance. This is especially true for HPC clusters with shared file storage, or for cloud environments using cost-efficient bucket storage.

While there are existing tools for performing coverage calculations, they are not optimised for deployment at large scales, or when analysing large reference files. They require several complete I/O operations of the sequencing data in order to generate coverage statistics. Mapping to very large genomes requires large amounts of memory, or alternatively, aligning reads in chunks creating more I/O operations. Moving I/O operations into memory, for example via `tempfs` may alleviate I/O bottlenecks. However, this is highly system-dependent and will exacerbate memory bottlenecks.

Expand Down Expand Up @@ -102,7 +102,7 @@ Mapping to very large reference genomes can place considerable strain on compute

# CoverM wrapper

Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will generate alignments with Minimap2, sort and save them in BAM format with SAMtools [@samtools], and run CoverM on the resulting BAM file. CoverM is currently not available for macOS and as such, this wrapper will only run on Linux systems.
Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with Minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will generate alignments with Minimap2, sort and save them in BAM format with SAMtools [@samtools], and run CoverM on the resulting BAM file. CoverM is currently not available for macOS and as such, this wrapper will only run on Linux systems.

# Benchmarks

Expand Down

0 comments on commit cb365b7

Please sign in to comment.