Introduction: what is sylph and how does it work?

Introduction: what is sylph?

Sylph is an extremely fast and memory efficient program for profiling and searching metagenomic samples against databases. It is 10-100x faster than other popular software such as MetaPhlAn or Kraken and more memory efficient too.

What can sylph do?

Profile metagenomes: sylph can calculate the abundances of genomes in a sample using a reference database. This is the same type of output as Kraken or MetaPhlAn.
Search genomes against metagenomes: sylph can check if a genome is contained in your sample (e.g. is this E. coli genome in my sample?).
ANI querying: sylph can estimate the containment average nucleotide identity (ANI) of a reference genome to the genomes in your sample.
Use custom reference databases: Eukaryotes, viruses, and any collections of fasta files are ok.
Long-reads are usable: sylph is primarily optimized for short-reads, but it can utilize nanopore or PacBio reads with high precision.
Calculate coverage: sylph can estimate the coverage (not just the abundance) of genomes in your database.

What can sylph NOT do?

Sylph can not:

Map reads. Unlike Kraken, sylph does not classify every read.
Find super low abundance genomes. Sylph requires > 0.01-0.05x coverage at minimum for bacterial genomes. All bacterial genomes need at least a few hundred short-reads.
Reliably find genomes at genus level or higher (if it is not present at species level). If your sample is not well-characterized by the database, sylph may struggle. Note: this also applies to most profilers.
Compare genomes to genomes, or compare metagenomes to metagenomes.
Work with 16S data.

How does sylph work?

The below figure summarizes sylph's main steps.

(Panel 1) Reads and reference genomes are broken into k-mers using the sylph sketch option. k-mers are downsampled by a fraction of c, default = 200.
(Panel 1) Using sylph query or sylph profile, the k-mers in each reference genome are checked against the k-mers in the reads.
(Panel 2) Sylph uses statistics to estimate the containment ANI between each reference genome and the metagenomes.
(Panel 3) sylph query: all genomes with high ANI (> 90% default) from the previous step are reported. No abundances.
(Panel 3) sylph profile: calculates abundances and reports the present genomes at species-level using a k-mer remapping algorithm if ANI > 95%.

How do I use sylph?

See the tutorials and manuals outlined in the README. The cookbook may especially be of use.

Provide feedback

Saved searches