sylph-utils: utility scripts and taxonomy for sylph
This repository contains scripts for incoporating taxonomic information into the output of sylph.
The following databases are currently supported (with pre-built sylph databases available here) and can be found in the prokaryote, fungi, virus
subfolders within this repository.
- GTDB-R220 (April 2024) -
prokaryote/gtdb_r220_metadata.tsv.gz
- GTDB-R214 (April 2023) -
prokaryote/gtdb_r214_metadata.tsv.gz
- OceanDNA -
prokaryote/ocean_dna_metadata.tsv.gz
- SMAG -
prokaryote/smag_metadata.tsv.gz
- Refseq fungi representative genomes -
fungi/fungi_refseq_2024-07-25_metadata.tsv.gz
- IMG/VR 4.1 high-confidence viral OTU genomes -
virus/IMGVR_4.1_metadata.tsv.gz
Tip
Viral genomes, viral-host annotations, and fungi taxonomy information are now available since 2024-07-25 (v0.2 of sylph_to_taxprof.py
)
- Python3
- Pandas
Run pip install pandas
if pandas is not installed.
python sylph_to_taxprof.py -m database_metadata.tsv.gz -s sylph_output.tsv -o prefix_or_folder/
-m
: taxonomy metadata file(s). Metadata files are of the formtsv.gz
and present in this repository. Multiple taxonomy metadata files can be input (will be concatenated)-s
: the output from sylph. Sylph's databases must be the same as the-m
option's.-o
: prepends this prefix to all of the output files; one output for each sample in the sylph output.--annotate-virus-hosts
(new in v0.2): annotates found viral genomes with host information metadata (only available for IMG/VR 4.1 right now)- Output suffix is
.sylphmpa
.
Use the metadata file corresponding to the database used. E.g. if you use the GTDB-R220 database for sylph, you must use the gtdb_r220_metadata.tsv.gz
file.
See here for more information on
- taxonomy metadata files definitions
- the output format
- how to create taxonomy metadata for customized genome databases
Tip
In python/pandas, you can read the output with pd.read_csv('output.sylphmpa',sep='\t', comment='#')
.
2024-10-03 - vresion 0.3
- The script now fails if it detects ambiguous sample names
- Added the
-f
or--add-folder-information
flag to include directory information into the output.sylphmpa
files; this can disambiguate sample names.
2024-07-25 - version 0.2
- Added IMG/VR 4.1 taxonomy and fungi refseq taxonomy.
- Added coverage information into the
.sylphmpa
files - Added capabilities to do viral-host annotation if the IMG/VR 4.1 metadata is used.
Merge multiple taxonomic profiles from sylph_to_taxprof.py
into a TSV table
python merge_sylph_taxprof.py *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv
*.sylphmpa
files from sylph_to_taxprof.py.--column
can be ANI, relative abundance, or sequence abundance (see paper for difference between abundances)-o
output file in TSV format.
clade_name sample1.fastq.gz sample2.fastq.gz
d__Archaea 0.0 1.1
d__Archaea|p__Methanobacteriota 0.0 0.0965
...