Skip to content

Introduction: what is sylph and how does it work?

Jim Shaw edited this page Apr 25, 2024 · 7 revisions

Introduction: what is sylph?

Sylph is an extremely fast and memory efficient program for profiling and searching metagenomic samples against databases. It is 10-100x faster than other popular software such as MetaPhlAn or Kraken and more memory efficient too.

What can sylph do?

  • Profile metagenomes: sylph can calculate the abundances of genomes in a sample using a reference database. This is the same type of output as Kraken or MetaPhlAn.
  • Search genomes against metagenomes: sylph can check if a genome is contained in your sample (e.g. is this E. coli genome in my sample?).
  • ANI querying: sylph can estimate the containment average nucleotide identity (ANI) of a reference genome to the genomes in your sample.
  • Use custom reference databases: Eukaryotes, viruses, and any collections of fasta files are ok.
  • Long-reads are usable: sylph is primarily optimized for short-reads, but it can utilize nanopore or PacBio reads with high precision.
  • Calculate coverage: sylph can estimate the coverage (not just the abundance) of genomes in your database.

What can sylph NOT do?

Sylph can not:

  • Map reads. Unlike Kraken, sylph does not classify every read.
  • Find super low abundance genomes. Sylph requires > 0.01-0.05x coverage at minimum for bacterial genomes. All bacterial genomes need at least a few hundred short-reads.
  • Reliably find genomes at genus level or higher (if it is not present at species level). If your sample is not well-characterized by the database, sylph may struggle. Note: this also applies to most profilers.
  • Compare genomes to genomes, or compare metagenomes to metagenomes.
  • Work with 16S data.

How does sylph work?

The below figure summarizes sylph's main steps.

  1. (Panel 1) Reads and reference genomes are broken into k-mers using the sylph sketch option. k-mers are downsampled by a fraction of c, default = 200.
  2. (Panel 1) Using sylph query or sylph profile, the k-mers in each reference genome are checked against the k-mers in the reads.
  3. (Panel 2) Sylph uses statistics to estimate the containment ANI between each reference genome and the metagenomes.
  4. (Panel 3) sylph query: all genomes with high ANI (> 90% default) from the previous step are reported. No abundances.
  5. (Panel 3) sylph profile: calculates abundances and reports the present genomes at species-level using a k-mer remapping algorithm if ANI > 95%.

How do I use sylph?

See the tutorials and manuals outlined in the README. The cookbook may especially be of use.