DOC: proofing fixes

beardymcjohnface · Feb 26, 2024 · cb365b7 · cb365b7
1 parent 924f151
commit cb365b7
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/paper/paper.md b/paper/paper.md
@@ -53,12 +53,12 @@ bibliography: paper.bib
 
 # Summary
 
-Genomes of organisms are constructed by assembling sequencing from whole genome sequencing (WGS). It is useful to determine sequencing read-coverage of the genome assembly, for instance identifying duplication or deletion events, related contigs for binning metagenomes [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomes [@condiga]. Although calculating read-coverage is a routine task, it typically involves several complete read and write operations (I/O operations). This is not a problem for small datasets, but can be a significant bottleneck for very large datasets. Koverage reduces I/O burden as much as possible to enable maximum scalability. Koverage includes a kmer-based method that significantly reduces the computational complexity for very large reference genomes. Koverage uses Snakemake [@snakemake], providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface, and is installable with PIP or Conda for maximum ease of use. Source code and documentation are available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).
+Genomes of organisms are constructed by assembling sequence reads from whole genome sequencing. It is useful to determine sequence read-coverage of genome assemblies, for instance identifying duplication or deletion events, identifying related contigs for binning metagenomes [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomes [@condiga]. Although calculating read-coverage is a routine task, it typically involves several complete read and write operations (I/O operations). This is not a problem for small datasets, but can be a significant bottleneck for very large datasets. Koverage reduces I/O burden as much as possible to enable maximum scalability. Koverage includes a kmer-based method that significantly reduces the computational complexity for very large reference genomes. Koverage uses Snakemake [@snakemake], providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface, and is installable with PIP or Conda for maximum ease of use. Source code and documentation are available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).
 
 
 # Statement of need
 
-With the current state of sequencing technologies, it is trivial to generate terabytes of sequencing data for hundreds or even thousands of samples. Databases such as the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA), containing nearly 100 petabytes combined of sequencing data, are constantly being mined and reanalysed in bioinformatics analyses. Memory and I/O bottlenecks lead to under-utilisation of CPUs, and computational inefficiencies at such scales waste thousands of dollars in compute costs. I/O heavy processes in large parallel batches can result in significantly impaired performance. This is especially true for HPC clusters with shared file storage, or for cloud environments using cost-efficient bucket storage.
+With the current state of sequencing technologies, it is trivial to generate terabytes of sequencing data for hundreds or even thousands of samples. Databases such as the Sequence Read Archive and the European Nucleotide Archive, containing nearly 100 petabytes combined of sequencing data, are constantly being mined and reanalysed in bioinformatics analyses. Memory and I/O bottlenecks lead to under-utilisation of CPUs, and computational inefficiencies at such scales waste thousands of dollars in compute costs. I/O heavy processes in large parallel batches can result in significantly impaired performance. This is especially true for HPC clusters with shared file storage, or for cloud environments using cost-efficient bucket storage.
 
 While there are existing tools for performing coverage calculations, they are not optimised for deployment at large scales, or when analysing large reference files. They require several complete I/O operations of the sequencing data in order to generate coverage statistics. Mapping to very large genomes requires large amounts of memory, or alternatively, aligning reads in chunks creating more I/O operations. Moving I/O operations into memory, for example via `tempfs` may alleviate I/O bottlenecks. However, this is highly system-dependent and will exacerbate memory bottlenecks. 
 
@@ -102,7 +102,7 @@ Mapping to very large reference genomes can place considerable strain on compute
 
 # CoverM wrapper
 
-Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will generate alignments with Minimap2, sort and save them in BAM format with SAMtools [@samtools], and run CoverM on the resulting BAM file. CoverM is currently not available for macOS and as such, this wrapper will only run on Linux systems.
+Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with Minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will generate alignments with Minimap2, sort and save them in BAM format with SAMtools [@samtools], and run CoverM on the resulting BAM file. CoverM is currently not available for macOS and as such, this wrapper will only run on Linux systems.
 
 # Benchmarks