Skip to content

bluenote-1577/sylph-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sylph-utils: utility scripts and taxonomy for sylph

This repository contains scripts for incoporating taxonomic information into the output of sylph.

Taxonomy integration - available databases

The following databases are currently supported (with pre-built sylph databases available here) and can be found in the prokaryote, fungi, virus subfolders within this repository.

  1. GTDB-R220 (April 2024) - prokaryote/gtdb_r220_metadata.tsv.gz
  2. GTDB-R214 (April 2023) - prokaryote/gtdb_r214_metadata.tsv.gz
  3. OceanDNA - prokaryote/ocean_dna_metadata.tsv.gz
  4. SMAG - prokaryote/smag_metadata.tsv.gz
  5. Refseq fungi representative genomes - fungi/fungi_refseq_2024-07-25_metadata.tsv.gz
  6. IMG/VR 4.1 high-confidence viral OTU genomes - virus/IMGVR_4.1_metadata.tsv.gz

Tip

Viral genomes, viral-host annotations, and fungi taxonomy information are now available since 2024-07-25 (v0.2 of sylph_to_taxprof.py)

Requirements/Install

  • Python3
  • Pandas

Run pip install pandas if pandas is not installed.

sylph_to_taxprof.py - obtaining taxonomic profiles from sylph's output

python sylph_to_taxprof.py -m database_metadata.tsv.gz -s sylph_output.tsv -o prefix_or_folder/
  • -m: taxonomy metadata file(s). Metadata files are of the form tsv.gz and present in this repository. Multiple taxonomy metadata files can be input (will be concatenated)
  • -s: the output from sylph. Sylph's databases must be the same as the -m option's.
  • -o: prepends this prefix to all of the output files; one output for each sample in the sylph output.
  • --annotate-virus-hosts (new in v0.2): annotates found viral genomes with host information metadata (only available for IMG/VR 4.1 right now)
  • Output suffix is .sylphmpa.

Use the metadata file corresponding to the database used. E.g. if you use the GTDB-R220 database for sylph, you must use the gtdb_r220_metadata.tsv.gz file.

See here for more information on

  1. taxonomy metadata files definitions
  2. the output format
  3. how to create taxonomy metadata for customized genome databases

Tip

In python/pandas, you can read the output with pd.read_csv('output.sylphmpa',sep='\t', comment='#').

CHANGELOG

2024-10-03 - vresion 0.3

  • The script now fails if it detects ambiguous sample names
  • Added the -f or --add-folder-information flag to include directory information into the output .sylphmpa files; this can disambiguate sample names.

2024-07-25 - version 0.2

  • Added IMG/VR 4.1 taxonomy and fungi refseq taxonomy.
  • Added coverage information into the .sylphmpa files
  • Added capabilities to do viral-host annotation if the IMG/VR 4.1 metadata is used.

merge_sylph_taxprof.py - merge multiple taxonomic profiles

Merge multiple taxonomic profiles from sylph_to_taxprof.py into a TSV table

python merge_sylph_taxprof.py *.sylphmpa --column {ANI, relative_abundance, sequence_abundance} -o output_table.tsv
  • *.sylphmpa files from sylph_to_taxprof.py.
  • --column can be ANI, relative abundance, or sequence abundance (see paper for difference between abundances)
  • -o output file in TSV format.

Output format

clade_name  sample1.fastq.gz  sample2.fastq.gz
d__Archaea  0.0  1.1
d__Archaea|p__Methanobacteriota 0.0     0.0965
...

About

Utility scripts and helpers for sylph

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages