Skip to content

Tutorial: setting up the GTDB genome database to search against

Jim Shaw edited this page Jul 14, 2024 · 16 revisions

The following tutorial shows how to search genomes with skani against a large database (> 85,000 genomes) in seconds.


Update 2024-03-31: GTDB-R214 pre-sketched databases

You can download a preprocessed version of the GTDB-R214 database for searching: the .tar.gz file is available here.

wget https://storage.googleapis.com/skani_files/skani-gtdb-r214-sketch-v0.2.tar.gz 
tar -zxvf skani-gtdb-r214-sketch-v0.2.tar.gz

skani search my_genome.fa -d skani-gtdb-r214-sketch-v0.2 -o results.tsv

The below tutorial will be faster if you already have the GTDB database available.


Note

GTDB-R220 is a new version of the GTDB database that supersedes GTDB-R214. This tutorial still uses GTDB-R214, but you can replace it with GTDB-R220 if desired.

Requirements

  • skani is installed and in PATH (i.e. typing skani -h works). See skani's README for how to install skani.
  • A decent internet connection; we'll be downloading a database of > 70 GB.
  • Around ~120 GB total disk space in total for the database and indexing.

What we will do

  • set up the GTDB representative database, a large database with > 85,000 genomes, from scratch
  • show how skani can search genomes/MAGs against the gtdb database in seconds

1. Download the GTDB database and unzip it.

The database is > 70 GB compressed, so this may take a while to download and unzip.

# version R214
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
tar -xf gtdb_genomes_reps_r214.tar.gz

# new version if desired
# wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/genomic_files_reps/gtdb_genomes_reps_r220.tar.gz
# tar -xf gtdb_genomes_reps_r220.tar.gz

2. Preprocess the files so skani can index the database.

The gtdb database is formatted in a special way. In order to process the reference genome files inside the gtdb folder, we have to do a bit of work. We can run the following to collect all genomes locations into a file called gtdb_file_names.txt.

find gtdb_genomes_reps_r214/ | grep .fna > gtdb_file_names.txt

The above command puts all reference genomes into a file, with each line containing one reference genome file.

3. Use skani sketch to index the database.

To construct the database called gtdb_skani_database_ani, use

skani sketch -l gtdb_file_names.txt -o gtdb_skani_database_ani -t 20

where I am using -t 20 for 20 threads, but feel free to use less if your workstation is limited.

4. Use skani search for low-memory and ultra-fast search

The database is now available in the folder gtdb_skani_database_ani. Let's try searching against it.

First, we'll download an E. coli K12 genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Now we can simply search against the database using

skani search GCF_000005845.2_ASM584v2_genomic.fna.gz -d gtdb_skani_database_ani

and you should see the following output (numbers differ slightly depending on the version of skani used or version of gtdb)

Ref_file	Query_file	ANI	Align_fraction_query	Align_fraction_reference	Ref_name	Query_name
gtdb_genomes_reps_r214/database/GCF/003/697/165/GCF_003697165.2_genomic.fna.gz	GCF_000005845.2_ASM584v2_genomic.fna.gz	96.92	77.63	84.20	NZ_CP033092.2 Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775 chromosome, complete genome	NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
...

The output is sorted by the closest ANI reference to the query, so the closest E. coli genome in the database is given by the first line. Each column of the output gives some information. The name of the reference organism is given by the second last column, which is E. coli DSM 30083. The columns

ANI	Align_fraction_query	Align_fraction_reference
96.92	77.63	84.20

tell us that the ANI between these two genomes are 96.92%, the query E. coli had 77.63% of its genome covered by alignments, the reference genome had 84.20% of its genome covered by alignments. The exact numbers may differ depending on which version of skani is used.

This only took ~6 seconds for my system and < 6 GB of memory:

Command being timed: "skani search GCF_000005845.2_ASM584v2_genomic.fna.gz -d gtdb_skani_database_ani"
	User time (seconds): 6.98
	System time (seconds): 1.70
	Percent of CPU this job got: 136%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.33
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5894392

and that's it! See Tutorial: classifying entire assemblies (MAGs or contigs) against 85,000 genomes in under 2 minutes for a second tutorial on how to use skani for classifying MAGs and contigs from assemblies.