-
Notifications
You must be signed in to change notification settings - Fork 5
Introduction: what is sylph and how does it work?
Jim Shaw edited this page Apr 25, 2024
·
7 revisions
Sylph is an extremely fast and memory efficient program for profiling and searching metagenomic samples against databases. It is 10-100x faster than other popular software such as MetaPhlAn or Kraken and more memory efficient too.
- Profile metagenomes: sylph can calculate the abundances of genomes in a sample using a reference database. This is the same type of output as Kraken or MetaPhlAn.
- Search genomes against metagenomes: sylph can check if a genome is contained in your sample (e.g. is this E. coli genome in my sample?).
- ANI querying: sylph can estimate the containment average nucleotide identity (ANI) of a reference genome to the genomes in your sample.
- Use custom reference databases: Eukaryotes, viruses, and any collections of fasta files are ok.
- Long-reads are usable: sylph is primarily optimized for short-reads, but it can utilize nanopore or PacBio reads with high precision.
- Calculate coverage: sylph can estimate the coverage (not just the abundance) of genomes in your database.
Sylph can not:
- Map reads. Unlike Kraken, sylph does not classify every read.
- Find super low abundance genomes. Sylph requires > 0.01-0.05x coverage at minimum for bacterial genomes. All bacterial genomes need at least a few hundred short-reads.
- Reliably find genomes at genus level or higher (if it is not present at species level). If your sample is not well-characterized by the database, sylph may struggle. Note: this also applies to most profilers.
- Compare genomes to genomes, or compare metagenomes to metagenomes.
- Work with 16S data.
The below figure summarizes sylph's main steps.
![](https://private-user-images.githubusercontent.com/12787948/293108118-04ff3385-a060-443d-940b-665a2a80a1ca.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk3MDUxNjgsIm5iZiI6MTcxOTcwNDg2OCwicGF0aCI6Ii8xMjc4Nzk0OC8yOTMxMDgxMTgtMDRmZjMzODUtYTA2MC00NDNkLTk0MGItNjY1YTJhODBhMWNhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI5VDIzNDc0OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFiZWYwOTdkOWZjMzkwZThjYjY4MDk1MWI1ZmNmMDBmMTAzMTBiMjY0MDQ2MDI5MDVlZDg5MGFhZWU2NjlhM2YmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Uv_epE9iMBhPI-f3tsOm0CKFPP354Wdyj1XmRhMHikA)
- (Panel 1) Reads and reference genomes are broken into k-mers using the
sylph sketch
option. k-mers are downsampled by a fraction ofc
, default = 200. - (Panel 1) Using
sylph query
orsylph profile
, the k-mers in each reference genome are checked against the k-mers in the reads. - (Panel 2) Sylph uses statistics to estimate the containment ANI between each reference genome and the metagenomes.
- (Panel 3)
sylph query
: all genomes with high ANI (> 90% default) from the previous step are reported. No abundances. - (Panel 3)
sylph profile
: calculates abundances and reports the present genomes at species-level using a k-mer remapping algorithm if ANI > 95%.
See the tutorials and manuals outlined in the README. The cookbook may especially be of use.