Skip to content

Taxonomic profiling with the GTDB‐R214 database

Jim Shaw edited this page Apr 25, 2024 · 14 revisions

In this tutorial, we will use sylph to do

  1. Multi-sample prokaryotic metagenomic profiling with the GTDB-R214 database
  2. output a taxonomic output like MetaPhlAn.

Requirements:

  1. sylph is installed.
  2. Python3 is installed with pandas

UPDATE (24th April, 2024) - GTDB r220 database (113,104 species representative genomes) released!

This tutorial uses the GTDB-R214 database, an older database. A new GTDB version (R220) is now out, with 35% more genomes.

Refer to the pre-built databases and updated metadata files to swap out the R214 database with R220 during this tutorial.


1. Download the GTDB-R214 database.

Option 1. Download pre-sketched database

Download the pre-sketched (i.e., indexed) GTDB-R214 database provided here. For example,

wget https://storage.googleapis.com/sylph-stuff/v0.3-c1000-gtdb-r214.syldb -O gtdb_database.syldb

# OR

#wget https://storage.googleapis.com/sylph-stuff/v0.3-c200-gtdb-r214.syldb -O gtdb_database.syldb
  • The c1000 database is smaller and faster, but less sensitive for low-abundance genomes compared to the c200 database.

Option 2. Create your own index

Assuming you have the GTDB-R214 genome database present in a folder called gtdb_genomes_reps_r214/, do the following:

  1. Compile all GTDB genomes into a list file: find gtdb_genomes_reps_r214 | grep .fna > gtdb_all.txt
  2. Create a database with: sylph sketch -l gtdb_all.txt -t 50 -o gtdb_database

2. Download and index mouse-gut metagenome

Let's download and index a mouse gut metagenomes:

wget https://storage.googleapis.com/sylph-stuff/mouse_1.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_1.fq.gz

wget https://storage.googleapis.com/sylph-stuff/mouse_2.fq.gz
wget https://storage.googleapis.com/sylph-stuff/B-mouse_2.fq.gz

We now have two sets of paired-reads. For paired-end reads, sylph must sketch the reads first as follows:

sylph sketch -1 *mouse_1.fq.gz -2 *mouse_2.fq.gz 

UPDATE (sylph version >= v0.6)

Since sylph v0.6, direct profiling of paired-end reads without sketching is possible:

sylph profile -1 *_1.fq.gz -2 *_2.fq.gz gtdb_database.syldb -t 10 -o results.tsv

sylph can sketch multiple samples at once. This allows for multi-threaded sketching and is more efficient.

This outputs two files called mouse_1.fq.gz.paired.sylsp and B-mouse_1.fq.gz.paired.sylsp. The paired indicates the sketch was created with the -1,-2 options, and the sylsp indicates that it is indexed.

WARNING: Do not interleave paired-end reads. sylph will give more accurate results if you do not concatenate/interleave your reads.

3. Metagenomic profiling with sylph

To multi-sample profile with sylph, run the following:

sylph profile *mouse_1.fq.gz.paired.sylsp gtdb_database.syldb -t 10 -o results.tsv

This uses 10 threads to profile the metagenome against our database into a file called results.tsv.

Inspecting the results.tsv file gives:

head results.tsv 
Sample_file	Genome_file	Taxonomic_abundance	Sequence_abundance	Adjusted_ANI	Eff_cov	ANI_5-95_percentile	Eff_lambda	Lambda_5-95_percentile	Median_cov	Mean_cov_geq1	Containment_ind	Naive_ANI	Contig_name
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/910/577/315/GCA_910577315.1_genomic.fna.gz	10.6258	10.1684	97.66	0.656	97.33-98.07	0.656	0.55-0.76	1	1.456	616/2624	95.43	CAJTME010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC104416 genome assembly, contig: MGBC104416.1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCF/001/945/605/GCF_001945605.1_genomic.fna.gz	9.1588	7.5114	98.36	0.565	97.92-98.84	0.565	0.47-0.66	1	1.421	553/2100	95.79	NZ_MPKA01000006.1 Dubosiella newyorkensis strain NYU-BL-A4 NODE_100_length_1023_cov_1250.21_ID_199, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/910/589/675/GCA_910589675.1_genomic.fna.gz	8.3282	5.1014	95.18	0.514	94.47-95.99	0.514	0.37-0.67	1	1.299	144/1629	92.47	CAJUTO010000001.1 uncultured Lactobacillus sp. isolate MGBC166701 genome assembly, contig: MGBC166701.1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/910/588/855/GCA_910588855.1_genomic.fna.gz	8.1294	11.9822	98.20	0.502	97.87-98.59	0.502	0.43-0.56	1	1.345	892/3903	95.35	CAJUSR010000001.1 TPA_asm: uncultured Kineothrix sp. isolate MGBC162921 genome assembly, contig: MGBC162921.1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/910/579/675/GCA_910579675.1_genomic.fna.gz	7.7809	7.5080	98.71	0.480	98.27-99.11	0.480	0.40-0.55	1	1.303	654/2522	95.74	CAJTUF010000001.1 TPA_asm: uncultured Muribaculaceae bacterium isolate MGBC114255 genome assembly, contig: MGBC114255.1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCF/001/591/705/GCF_001591705.1_genomic.fna.gz	7.1538	6.0999	96.41	0.441	95.86-97.20	0.441	0.32-0.54	1	1.216	264/2258	93.31	NZ_BCVK01000001.1 Lactococcus lactis subsp. cremoris NBRC 100676, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCF/000/403/395/GCF_000403395.2_genomic.fna.gz	6.9472	9.1490	95.75	0.429	95.27-96.45	0.429	0.32-0.51	1	1.251	299/3240	92.60	NZ_KE159657.1 Anaerotruncus sp. G3(2012) strain G3 acPFl-supercont1.1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/003/833/075/GCA_003833075.1_genomic.fna.gz	5.7036	7.2196	97.02	0.352	96.46-97.75	0.352	0.27-0.42	1	1.234	394/3324	93.35	RIAY01000001.1 Muribaculaceae bacterium Isolate-036 (Harlan) seq1, whole genome shotgun sequence
mouse_1.fq	gtdb_genomes_reps_r214/database/GCA/910/587/685/GCA_910587685.1_genomic.fna.gz	5.0394	3.2464	97.82	0.311	97.06-98.58	0.311	0.24-0.39	1	1.200	230/1672	93.80	CAJUMY010000001.1 TPA_asm: uncultured Christensenellaceae bacterium isolate MGBC161649 genome assembly, contig: MGBC161649.1, whole genome shotgun sequence

You can check https://github.com/bluenote-1577/sylph/wiki/Output-format for the definitions of each of the columns. Various metrics, including (effective) coverage estimates, ANIs, abundances, etc are output.

4. Get a taxonomic profile

Sylph's TSV output has no taxonomic information, only genome information. To get a taxonomic profile with ranks, we integrate GTDB's taxonomy with sylph's output. See here for information on how this works and how to customize taxonomy/databases.

Run the following commands:

git clone https://github.com/bluenote-1577/sylph-utils
python sylph-utils/sylph_to_taxprof.py -s results.tsv -m  sylph-utils/gtdb_r214_metadata.tsv.gz -o prefix_
ls prefix_mouse_1.fq.gz.sylphmpa
ls prefix_B-mouse_2.fq.gz.sylphmpa

The python script takes in a sylph result (-s), a taxonomy metadata file (-m).

IMPORTANT: -m's metadata file must correspond to database used. If you use GTDB-R220 or R214, you must use the correct R220 or R214 metadata file.

The script outputs a new file called prefix_MYSAMPLENAME.sylphmpa for each sample in the results file. Investigating one file gives the following:

head -n 20 prefix_mouse_1.fq.sylphmpa                                                                                       
#SampleID	mouse_1.fq
clade_name	relative_abundance	sequence_abundance	ANI (if strain-level)
d__Bacteria	100.00010000000002	99.99999999999999	NA
d__Bacteria|p__Bacillota	24.640800000000002	18.712699999999998	NA
d__Bacteria|p__Bacillota_A	47.333499999999994	52.5969	NA
d__Bacteria|p__Bacillota_A|c__Clostridia	47.333499999999994	52.5969	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales	9.0019	6.252800000000001	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae	3.9625	3.0064	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter	3.9625	3.0064	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter	3.9625	3.0064	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__Pumilibacteraceae|g__Pumilibacter|s__Pumilibacter|t__GCA_910587595.1	3.9625	3.0064	95.45
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700	5.0394	3.2464	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649	5.0394	3.2464	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649	5.0394	3.2464	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Christensenellales|f__UBA3700|g__MGBC161649|s__MGBC161649|t__GCA_910587685.1	5.0394	3.2464	97.82
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales	20.3153	24.426900000000003	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae	20.3153	24.426900000000003	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69	4.1703	4.1626	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69	4.1703	4.1626	NA
d__Bacteria|p__Bacillota_A|c__Clostridia|o__Lachnospirales|f__Lachnospiraceae|g__1XD42-69|s__1XD42-69|t__GCA_910589105.1	4.1703	4.1626	97.47

This file is a taxonomic profile similar to what MetaPhlAn outputs. Each taxonomic rank has an associated taxonomic or sequence abundance. Taxonomic abundance normalizes by genome sizes, whereas sequence abundance is the % of reads assigned to a taxonomic rank.

Note that the t__GCA_... ranks, which are strain/genome-level ranks, have an associated ANI output. The other ranks do not.

5. Conclusion

We have just shown how sylph can do multi-sample profiling against the GTDB-R214 database very efficiently while giving ANI information as well.

To do taxonomic profiling with other databases, see https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph. Alternatively, consider just analyzing sylph's resulting TSV file; it contains some useful information not present in the taxonomic profile.