Skip to content

Commit

Permalink
Merge pull request #44 from csoneson/JOSS
Browse files Browse the repository at this point in the history
Minor modifications to paper and bibliography
  • Loading branch information
beardymcjohnface committed Feb 25, 2024
2 parents 7a0456a + 67ce277 commit c42a0f3
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ @Article{snakemake
}
@article{snaketool,
doi = {10.1371/journal.pcbi.1010705},
author = {Roach, Michael J AND Pierce-Ward, N Tessa AND Suchecki, Radoslaw AND Mallawaarachchi, Vijini AND Papudeshi, Bhavya AND Handley, Scott A AND Brown, C Titus AND Watson-Haigh, Nathan S AND Edwards, Robert A},
author = {Roach, Michael J and Pierce-Ward, N Tessa and Suchecki, Radoslaw and Mallawaarachchi, Vijini and Papudeshi, Bhavya and Handley, Scott A and Brown, C Titus and Watson-Haigh, Nathan S and Edwards, Robert A},
journal = {PLOS Computational Biology},
publisher = {Public Library of Science},
title = {Ten simple rules and a template for creating workflows-as-applications},
Expand Down Expand Up @@ -147,7 +147,7 @@ @misc{metasnek
howpublished = {\url{https://github.com/beardymcjohnface/metasnek}},
}
@article{coral,
author = {Lima, Laís FO and Alker, Amanda T and Papudeshi, Bhavya and Morris, Megan M and Edwards, Robert A and de Putron, Samantha J and Dinsdale, Elizabeth A},
author = {Lima, Laís FO and Alker, Amanda T and Papudeshi, Bhavya and Morris, Megan M and Edwards, Robert A and {de Putron}, Samantha J and Dinsdale, Elizabeth A},
title = "{Coral and Seawater Metagenomes Reveal Key Microbial Functions to Coral Health and Ecosystem Functioning Shaped at Reef Scale}",
year = {2023},
doi = {10.1007/s00248-022-02094-6},
Expand Down
8 changes: 4 additions & 4 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ bibliography: paper.bib

# Summary

Genomes of organisms are constructed by assembling short fragments (called sequencing reads) that are the resulting data outputs of whole genome sequencing (WGS). It is useful to determine the read-coverage of sequencing reads in the resulting genome assembly for many reasons, such as identifying duplication or deletion events, identifying related contigs for binning in metagenome assemblies [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomic samples [@condiga]. Although calculating the read-coverage of sequencing reads to a reference genome is a routine task, it typically involves several read and write operations (I/O operations) of the sequencing data. Although this is not a problem for small datasets, it can be a significant bottleneck when analysing a large number of samples, or when screening very large reference sequence files. Koverage is designed to reduce the I/O burden as much as possible to enable maximum scalability for large sample sizes. Koverage also includes a kmer-based coverage method that significantly reduces the computational complexity of screening large reference genomes such as the human genome. Koverage is a Snakemake [@snakemake] based pipeline, providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface and is available to install via PIP or Conda for maximum ease-of-use. The source code and documentation is available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).
Genomes of organisms are constructed by assembling short fragments (called sequencing reads) that are the resulting data outputs of whole genome sequencing (WGS). It is useful to determine the read-coverage of sequencing reads in the resulting genome assembly for many reasons, such as identifying duplication or deletion events, identifying related contigs for binning in metagenome assemblies [@metacoag;@graphbin2], or analysing taxonomic compositions of metagenomic samples [@condiga]. Although calculating the read-coverage of sequencing reads to a reference genome is a routine task, it typically involves several read and write operations (I/O operations) of the sequencing data. Although this is not a problem for small datasets, it can be a significant bottleneck when analysing a large number of samples, or when screening very large reference sequence files. Koverage is designed to reduce the I/O burden as much as possible to enable maximum scalability for large sample sizes. Koverage also includes a kmer-based coverage method that significantly reduces the computational complexity of screening large reference genomes such as the human genome. Koverage is a Snakemake [@snakemake] based pipeline, providing out-of-the-box support for HPC and cloud environments. It utilises the Snaketool [@snaketool] command line interface and is available to install via PIP or Conda for maximum ease of use. The source code and documentation is available at [https://github.com/beardymcjohnface/Koverage](https://github.com/beardymcjohnface/Koverage).


# Statement of need
Expand All @@ -73,7 +73,7 @@ Koverage will parse reads (`--reads`) using MetaSnek `fastq_finder` [@metasnek].

# Mapping-based coverage

This is the default method for calculating coverage statistics. Reads are mapped sample-by-sample to the reference genome using Minimap2 [@minimap]. The minimap2 alignments are parsed in real-time by a script that collects the counts per contig and total counts per sample. Koverage also uses the read mapping coordinates to collect read counts for `_bins_` or `_windows_` along the contig. This allows for a fast approximation of the coverage of each contig by at least one read (hitrate), and of the evenness of coverage (variance) for each contig. Following mapping, the final counts, mean, median, hitrate, and variance are written to a TSV file. A second script calculates the Reads Per Million (RPM), Reads Per Kilobase Million (RPKM), Reads Per Kilobase (RPK), and Transcripts Per Million (TPM) like so:
This is the default method for calculating coverage statistics. Reads are mapped sample-by-sample to the reference genome using minimap2 [@minimap]. The minimap2 alignments are parsed in real time by a script that collects the counts per contig and total counts per sample. Koverage also uses the read mapping coordinates to collect read counts for _bins_ or _windows_ along the contig. This allows for a fast approximation of the coverage of each contig by at least one read (hitrate), and of the evenness of coverage (variance) for each contig. Following mapping, the final counts, mean, median, hitrate, and variance are written to a TSV file. A second script calculates the Reads Per Million (RPM), Reads Per Kilobase Million (RPKM), Reads Per Kilobase (RPK), and Transcripts Per Million (TPM) like so:

__RPM__ = $\frac{10^6 \times N}{T}$

Expand All @@ -94,15 +94,15 @@ As mentioned, Koverage uses a fast estimation for mean, median, hitrate, and var

![Windowed-coverage counts. Counts of start coordinates of mapped reads are collected for each bin across a contig. The counts array is used to calculate estimates for coverage hitrate and variance.\label{fig:counts}](fig1.png){ width=100% }

Lastly, the coverage from all samples are collated, and a summary of the coverage for each contig by all samples is calculated. A summary HTML report is then generated which includes interactive graphs and tables for both the per sample coverge, and the combined coverage from all samples. In the HTML report, we utilized Datapane [@datapane] to embed both a combined bar and line chart from Plotly [@plotly] and an interactive table displaying the results. This visualization represents the reads that have been mapped to each contig within the given reference sequence. The visualization is organized into two distinct tabs: one showcasing the individual read files with their associated mapping, and the other illustrating the combined read files with their respective mapping.
Lastly, the coverage from all samples are collated, and a summary of the coverage for each contig by all samples is calculated. A summary HTML report is then generated which includes interactive graphs and tables for both the per sample coverage, and the combined coverage from all samples. In the HTML report, we utilized Datapane [@datapane] to embed both a combined bar and line chart from Plotly [@plotly] and an interactive table displaying the results. This visualization represents the reads that have been mapped to each contig within the given reference sequence. The visualization is organized into two distinct tabs: one showcasing the individual read files with their associated mapping, and the other illustrating the combined read files with their respective mapping.

# Kmer-based coverage

Mapping to very large reference genomes can place considerable strain on computer resources. As an alternative, Koverage offers a kmer-based approach to estimating coverage across contigs. First, the reference genome is processed and kmers are sampled evenly across each contig. The user can customise the kmer size, sampling interval, and minimum and maximum number of kmers to sample for each contig. Jellyfish [@jellyfish] databases are then created for each sample. Koverage will initiate an interactive Jellyfish session for each sample's kmer database. The kmers that were sampled from each reference contig are queried against the sample kmer database and the kmer counts, and a kmer count array is created for each contig. The sum, mean, and median are calculated directly from the count array, and the hitrate is calculated as the number of kmer counts > 0 divided by the total number of kmers queried. As variance is highly sensitive to large outliers, and kmer counts are especially prone to large outliers for repetitive sequences, the variance is calculated as the standard variance of the lowest 95 % of kmer counts.

# CoverM wrapper

Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will use Minimap2 to generate alignments, sort them and save them in BAM format with SamTools [@samtools], and then run CoverM on the resulting BAM file. While this is not the fastest method for running CoverM, it is convenient for users wishing to retain the sorted alignments in BAM format, and for automated running over many samples with a combined output summary file. CoverM is currently not available for MacOS and as such, this wrapper will only run on Linux systems.
Koverage includes a wrapper for the popular CoverM [@coverm] tool. CoverM can parse aligned and sorted reads in BAM format. It can also align reads with minimap2, saving the sorted alignments in a temporary filesystem (tempfs), and then process the aligned and sorted reads from tempfs. When a large enough tempfs is available, this method of running CoverM is extremely fast. However, if the tempfs is insufficient for storing the alignments, they are instead written to and read from regular disk storage which can be a significant I/O bottleneck. This wrapper in Koverage will use minimap2 to generate alignments, sort them and save them in BAM format with SAMtools [@samtools], and then run CoverM on the resulting BAM file. While this is not the fastest method for running CoverM, it is convenient for users wishing to retain the sorted alignments in BAM format, and for automated running over many samples with a combined output summary file. CoverM is currently not available for macOS and as such, this wrapper will only run on Linux systems.

# Benchmarks

Expand Down

0 comments on commit c42a0f3

Please sign in to comment.