There are two ways of viewing this:
The first idea is to reproduce in Rust the KmerFinder [2] (in Python, but also in JavaScript).
- K-mer count on FASTQ.
- Filter by prefix.
- Make it work for FASTA and BED files.
- Compare k-mer distribution of two inputs.
- Move towards a KMA implementation.
For this example, the first two FASTQ files of SRR396636, corresponding to reads from Pseudomonas aeruginsa MPAO1/P1, with 1909263 sequences of ~100 bp each, were downloaded.
Having ripkmer installed and in the $PATH
:
ripkmer SRR396636.sra_1.fastq SRR396636.sra_2.fastq
where the k
number and the prefix
would be left as default, being equivalent
to:
ripkmer SRR396636.sra_1.fastq SRR396636.sra_2.fastq 16 ATGAC
The output is in tabular format and can be redirected to standard output (and should not take much more than 4s).
(16-mers) Unique Redundant Intersection_unique Intersection
SRR396636.sra_1.fastq 23196 97871 34.19% 58.81%
SRR396636.sra_2.fastq 30698 89107 25.83% 64.59%
where
- Unique is the number of unique k-mers found in the file;
- Redundant is the number of total k-mers found (with repetitions);
- Interesection_unique is the number of common unique k-mers found in both files;
- and Intersection is the number of total common k-mers found.
[1] Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.
[2] Benchmarking of Methods for Genomic Taxonomy. Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, Sicheritz-Pontén T, Aarestrup FM, Ussery DW, Lund O. J Clin Microbiol. 2014 Feb 26