---
layout: post  
title: Selecting Genomes by Taxonomy  
date: 2019-08-05  
author: Cameron Prybol  

---

In the [previous post](/downloading-refseq-genomes.html) I showed how to download reference genomes from the RefSeq database. The next step is to layer taxonomy information onto the genome information. This will allow us survey the taxonomic breakdown of organisms represented within the RefSeq reference genomes. It will also allow us to randomly sample genomes of interest uniformly from each branch of the taxonomy tree. Specifically, I would like to sample uniformly across the kingdoms of life (e.g. Animalia, Plantae, Fungi, Protista, Archaea, Bacteria, and Viruses)(different authorities utilize different kingdoms of life so your preferred breakdown may be different!)

We first need reference taxonomy information from NCBI available [here](ftp://ftp.ncbi.nih.gov/pub/taxonomy/). In the associated [README](ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt), we can see that the taxonomy information is available as a compressed archive (in .zip, .tar.gz, and .tar.Z formats) and descriptions of what is contained in each file within the taxonomy archive. I'll download the .tar.gz archive and decompress it.

In [1]:
if !isdir("../datasets/taxonomy")
    mkdir("../datasets/taxonomy")
end
if !isfile("../datasets/taxonomy/taxdump.tar.gz")
    download("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", "../datasets/taxonomy/taxdump.tar.gz")
    run(`tar -xzf ../datasets/taxonomy/taxdump.tar.gz -C ../datasets/taxonomy`)
end

Next, we can import the [uCSV](https://github.com/cjprybol/uCSV.jl) library to read in the relevant delimited text files and [DataFrames](https://github.com/JuliaData/DataFrames.jl) to handle the datasets in memory

In [2]:
using uCSV
using DataFrames

Now we can import the RefSeq genome table and appropriately subset the table using the same selection criteria as used [previously](/downloading-refseq-genomes.html)

In [3]:
refseq_metadata = DataFrame(uCSV.read("../datasets/assembly_summary_refseq.txt", delim="\t", header=2))
refseq_metadata = 
    refseq_metadata[
        (refseq_metadata[:version_status] .== "latest") .&
        (refseq_metadata[:genome_rep] .== "Full") .&
        (refseq_metadata[:refseq_category] .== "reference genome")
        , :]

Unnamed: 0_level_0,# assembly_accession,bioproject,biosample,wgs_master,refseq_category,taxid,species_taxid,organism_name,infraspecific_name,isolate,version_status,assembly_level,release_type,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material
Unnamed: 0_level_1,String,String,String,String,String,Int64,Int64,String,String,String,String,String,String,String,String,String,String,String,String,String,String,String
1,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,latest,Chromosome,Major,Full,2014/08/01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics,GCA_000001215.4,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT,,
2,GCF_000001405.39,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,latest,Chromosome,Patch,Full,2019/02/28,GRCh38.p13,Genome Reference Consortium,GCA_000001405.28,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13,,
3,GCF_000001635.26,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,latest,Chromosome,Patch,Full,2017/09/15,GRCm38.p6,Genome Reference Consortium,GCA_000001635.8,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6,,
4,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,latest,Chromosome,Minor,Full,2018/03/15,TAIR10.1,The Arabidopsis Information Resource (TAIR),GCA_000001735.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1,,
5,GCF_000002035.6,PRJNA13922,SAMN06930106,,reference genome,7955,7955,Danio rerio,,,latest,Chromosome,Major,Full,2017/05/09,GRCz11,Genome Reference Consortium,GCA_000002035.4,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.6_GRCz11,,
6,GCF_000002985.6,PRJNA158,SAMEA3138177,,reference genome,6239,6239,Caenorhabditis elegans,strain=Bristol N2,,latest,Complete Genome,Major,Full,2013/02/07,WBcel235,C. elegans Sequencing Consortium,GCA_000002985.3,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235,,
7,GCF_000005845.2,PRJNA57779,SAMN02604091,,reference genome,511145,562,Escherichia coli str. K-12 substr. MG1655,strain=K-12 substr. MG1655,,latest,Complete Genome,Major,Full,2013/09/26,ASM584v2,Univ. Wisconsin,GCA_000005845.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2,,
8,GCF_000006745.1,PRJNA57623,SAMN02603969,,reference genome,243277,666,Vibrio cholerae O1 biovar El Tor str. N16961,strain=N16961,,latest,Complete Genome,Major,Full,2001/01/09,ASM674v1,TIGR,GCA_000006745.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/745/GCF_000006745.1_ASM674v1,,
9,GCF_000006765.1,PRJNA57945,SAMN02603714,,reference genome,208964,287,Pseudomonas aeruginosa PAO1,strain=PAO1,,latest,Complete Genome,Major,Full,2006/07/07,ASM676v1,PathoGenesis Corporation,GCA_000006765.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1,,
10,GCF_000006785.2,PRJNA57845,SAMN02604089,,reference genome,160490,1314,Streptococcus pyogenes M1 GAS,strain=SF370,,latest,Complete Genome,Major,Full,2014/04/01,ASM678v2,Univ. Oklahoma,GCA_000006785.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/785/GCF_000006785.2_ASM678v2,,


In the taxonomy [README](ftp://ftp.ncbi.nih.gov/pub/taxonomy/) we can see that the file `nodes.txt` contains a table mapping the taxonomy id (`tax_id` in `nodes.txt`, and `species_taxid` in `assembly_summary_refseq.txt`) to taxonomic rank and taxonomic division (among other fields that we'll ignore for now), so let's read that table in next

In [4]:
columns = first(uCSV.read("../datasets/taxonomy/nodes.dmp", delim="|", trimwhitespace=true))[[1, 3, 5]]
column_names = [:tax_id, :rank, :division_id]
nodes = DataFrame(columns, column_names)

Unnamed: 0_level_0,tax_id,rank,division_id
Unnamed: 0_level_1,Int64,String,Int64
1,1,no rank,8
2,2,superkingdom,0
3,6,genus,0
4,7,species,0
5,9,species,0
6,10,genus,0
7,11,species,0
8,13,genus,0
9,14,species,0
10,16,genus,0


The nodes table does not have species name. We don't necessarily need the species name because we have it available in the refseq table. However, it would be nice to have a second field of data to verify that the taxonomic ids from refseq and the taxonomic database are indeed referring to the same species. Thus, we will read in the names table and use it to verify species accuracy later. Note there are multiple fields associated for each taxonomic id in the names table (synonyms used by different authorities and references to the publications identifying/reclassifying the species), so we will drop those by subsetting to only include the rows with `"scientific name"`.

In [5]:
columns = first(uCSV.read("../datasets/taxonomy/names.dmp", delim='|', trimwhitespace=true))[[1, 2, 4]]
column_names = [:tax_id, :name, :name_class]
names = DataFrame(columns, column_names)
names = names[names[:name_class] .== "scientific name", :]

Unnamed: 0_level_0,tax_id,name,name_class
Unnamed: 0_level_1,Int64,String,String
1,1,root,scientific name
2,2,Bacteria,scientific name
3,6,Azorhizobium,scientific name
4,7,Azorhizobium caulinodans,scientific name
5,9,Buchnera aphidicola,scientific name
6,10,Cellvibrio,scientific name
7,11,Cellulomonas gilvus,scientific name
8,13,Dictyoglomus,scientific name
9,14,Dictyoglomus thermophilum,scientific name
10,16,Methylophilus,scientific name


Currently we have division_id as a field that will allow us to group by, approximately, taxonomic kingdom (the groups aren't all separated by kingdom). However, we don't have any names associated with the divisions yet. We can import division name information via the `division.dmp` file and join the tables on the `division_id` (1st) and `division_name` (3rd) fields

In [6]:
columns = first(uCSV.read("../datasets/taxonomy/division.dmp", delim='|', trimwhitespace=true))[[1, 3]]
column_names = [:division_id, :division_name]
divisions = DataFrame(columns, column_names)

Unnamed: 0_level_0,division_id,division_name
Unnamed: 0_level_1,Int64,String
1,0,Bacteria
2,1,Invertebrates
3,2,Mammals
4,3,Phages
5,4,Plants and Fungi
6,5,Primates
7,6,Rodents
8,7,Synthetic and Chimeric
9,8,Unassigned
10,9,Viruses


We can now join the nodes table with the names table, and the resulting table with the divisions table to associate taxonomy id & rank with species names and division names

In [7]:
taxonomy_info = join(join(nodes, names; on = :tax_id, kind = :inner), divisions, on = :division_id, kind = :inner)

Unnamed: 0_level_0,tax_id,rank,division_id,name,name_class,division_name
Unnamed: 0_level_1,Int64,String,Int64,String,String,String
1,1,no rank,8,root,scientific name,Unassigned
2,2,superkingdom,0,Bacteria,scientific name,Bacteria
3,6,genus,0,Azorhizobium,scientific name,Bacteria
4,7,species,0,Azorhizobium caulinodans,scientific name,Bacteria
5,9,species,0,Buchnera aphidicola,scientific name,Bacteria
6,10,genus,0,Cellvibrio,scientific name,Bacteria
7,11,species,0,Cellulomonas gilvus,scientific name,Bacteria
8,13,genus,0,Dictyoglomus,scientific name,Bacteria
9,14,species,0,Dictyoglomus thermophilum,scientific name,Bacteria
10,16,genus,0,Methylophilus,scientific name,Bacteria


And for the final join, we can merge our taxonomy information with the refseq assembly summary table to associate taxonomic division information with the reference genomes we have available

In [8]:
genome_info = join(taxonomy_info, refseq_metadata, on = (:tax_id => :species_taxid), kind=:inner)

Unnamed: 0_level_0,tax_id,rank,division_id,name,name_class,division_name,# assembly_accession,bioproject,biosample,wgs_master,refseq_category,taxid,organism_name,infraspecific_name,isolate,version_status,assembly_level,release_type,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material
Unnamed: 0_level_1,Int64,String,Int64,String,String,String,String,String,String,String,String,Int64,String,String,String,String,String,String,String,String,String,String,String,String,String,String,String
1,9,species,0,Buchnera aphidicola,scientific name,Bacteria,GCF_000009605.1,PRJNA57805,SAMD00061095,,reference genome,107806,Buchnera aphidicola str. APS (Acyrthosiphon pisum),strain=APS,Tokyo1998,latest,Complete Genome,Major,Full,2004/05/11,ASM960v1,Rikken GSC,GCA_000009605.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/605/GCF_000009605.1_ASM960v1,,
2,139,species,0,Borreliella burgdorferi,scientific name,Bacteria,GCF_000008685.2,PRJNA57581,SAMN02603966,,reference genome,224326,Borreliella burgdorferi B31,strain=B31,,latest,Complete Genome,Major,Full,2011/11/09,ASM868v2,TIGR,GCA_000008685.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2,,assembly from type material
3,158,species,0,Treponema denticola,scientific name,Bacteria,GCF_000008185.1,PRJNA57583,SAMN02603967,,reference genome,243275,Treponema denticola ATCC 35405,strain=ATCC 35405,,latest,Complete Genome,Major,Full,2004/02/03,ASM818v1,TIGR,GCA_000008185.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/185/GCF_000008185.1_ASM818v1,,assembly from type material
4,173,species,0,Leptospira interrogans,scientific name,Bacteria,GCF_000092565.1,PRJNA57881,SAMN02603127,,reference genome,189518,Leptospira interrogans serovar Lai str. 56601,strain=56601,,latest,Complete Genome,Major,Full,2010/04/06,ASM9256v1,"Chinese National HGC, Shanghai",GCA_000092565.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/092/565/GCF_000092565.1_ASM9256v1,,
5,197,species,0,Campylobacter jejuni,scientific name,Bacteria,GCF_000009085.1,PRJNA57587,SAMEA1705929,,reference genome,192222,Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819,strain=NCTC 11168,,latest,Complete Genome,Major,Full,2003/05/06,ASM908v1,Sanger Institute,GCA_000009085.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/085/GCF_000009085.1_ASM908v1,,
6,210,species,0,Helicobacter pylori,scientific name,Bacteria,GCF_000008525.1,PRJNA57787,SAMN02603995,,reference genome,85962,Helicobacter pylori 26695,strain=26695,,latest,Complete Genome,Major,Full,1999/12/22,ASM852v1,TIGR,GCA_000008525.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/525/GCF_000008525.1_ASM852v1,,
7,263,species,0,Francisella tularensis,scientific name,Bacteria,GCF_000008985.1,PRJNA57589,SAMEA3138185,,reference genome,177416,Francisella tularensis subsp. tularensis SCHU S4,strain=SCHU S4,,latest,Complete Genome,Major,Full,2009/06/18,ASM898v1,Swedish Defence Research Agency,GCA_000008985.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/985/GCF_000008985.1_ASM898v1,,
8,274,species,0,Thermus thermophilus,scientific name,Bacteria,GCF_000091545.1,PRJNA58223,SAMD00061070,,reference genome,300852,Thermus thermophilus HB8,strain=HB8,,latest,Complete Genome,Major,Full,2004/11/15,ASM9154v1,"National Institute of Advanced Industrial Science and Technology, Japan",GCA_000091545.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/091/545/GCF_000091545.1_ASM9154v1,,assembly from type material
9,287,species,0,Pseudomonas aeruginosa,scientific name,Bacteria,GCF_000006765.1,PRJNA57945,SAMN02603714,,reference genome,208964,Pseudomonas aeruginosa PAO1,strain=PAO1,,latest,Complete Genome,Major,Full,2006/07/07,ASM676v1,PathoGenesis Corporation,GCA_000006765.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1,,
10,303,species,0,Pseudomonas putida,scientific name,Bacteria,GCF_000007565.2,PRJNA57843,SAMN02603999,,reference genome,160488,Pseudomonas putida KT2440,strain=KT2440,,latest,Complete Genome,Major,Full,2016/02/26,ASM756v2,TIGR,GCA_000007565.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/565/GCF_000007565.2_ASM756v2,,


Visually, we can see that the `name` field from the taxonomy information only refers to the species name and the `organism_name` field from the refseq database contains species name as well as strain-level information (if available). Therefor, the quickest way to check to see if all of the names match is to see if all of the `organism_name` fields contain the `name` field as a prefix

In [9]:
startswith.(genome_info[:organism_name], genome_info[:name])

173-element BitArray{1}:
  true
  true
  true
  true
  true
  true
  true
  true
  true
  true
  true
  true
  true
     ⋮
  true
  true
  true
  true
 false
 false
  true
 false
 false
 false
 false
 false

Not every organism name is prefixed with the species name, but it looks like most organism names are prefixed by the species name. Investigating to see how many names follow the expected behavior pattern, we can simply count the number of `true`s from the previous query

In [10]:
count(startswith.(genome_info[:organism_name], genome_info[:name]))

159

That's almost all of the rows in the table, so it should be easy to manually review the rows that don't follow the expected behavior. We can do this by finding the rows that don't follow the expected pattern and then subsetting the table to these rows

In [11]:
indices = findall(startswith.(genome_info[:organism_name], genome_info[:name]) .== false)
genome_info[indices, [:name, :organism_name]]

Unnamed: 0_level_0,name,organism_name
Unnamed: 0_level_1,String,String
1,Bacillus thuringiensis,[Bacillus thuringiensis] serovar konkukian str. 97-27
2,Hepacivirus C,Hepatitis C virus subtype 1a
3,Hepacivirus C,Hepatitis C virus QC69
4,Marburg marburgvirus,"Marburg virus - Musoke, Kenya, 1980"
5,Norwalk virus,Norovirus GI
6,Norwalk virus,Norovirus GV
7,Pseudomonas syringae group genomosp. 3,Pseudomonas syringae pv. tomato str. DC3000
8,Rodent pegivirus,Pegivirus J
9,Middle East respiratory syndrome-related coronavirus,Human betacoronavirus 2c EMC/2012
10,Panicum papanivirus 1,Panicum mosaic satellite virus


These naming inconsistencies seem minor and inconsequential and appear to all be synonyms, so we can continue with what we have

The next question I am interested in is how many organisms from each taxonomic division are represented within the RefSeq reference genomes. The easiest way to ask this is to utilize the `countmap` function from the [StatsBase](https://github.com/JuliaStats/StatsBase.jl) library. We can see that the reference genome database is primarily composed of bacterial and viral genomes, with a few phages and then just 1-2 of all remaining groups.

In [12]:
using StatsBase
countmap(genome_info[:division_name])

Dict{String,Int64} with 8 entries:
  "Bacteria"         => 120
  "Rodents"          => 1
  "Vertebrates"      => 1
  "Primates"         => 1
  "Invertebrates"    => 2
  "Phages"           => 5
  "Plants and Fungi" => 2
  "Viruses"          => 41

Finally, let's randomly sample a representative genome from each category. I'll implement a special case for `Invertebrates` and `Plants and Fungi`, where I would prefer to evaluate both organisms within each group (Invertebrates ~ C. elegans & D. melanogaster, Plants and Fungi ~ A. thaliana & S. cerevisiae)

In [13]:
using Random
Random.seed!(3)
for division in unique(genome_info[:division_name])
    if division == "Plants and Fungi" || division == "Invertebrates"
        subset = genome_info[findall(genome_info[:division_name] .== division), :]
        for row in eachrow(subset)
            println(join([division, row[:name], basename(row[:ftp_path]) * "_genomic.fna.gz"], "\t\t"))
        end
    else
        subset = genome_info[rand(findall(genome_info[:division_name] .== division)), :]
        println(join([division, subset[:name], basename(subset[:ftp_path]) * "_genomic.fna.gz"], " \t\t"))
    end
end

Bacteria 		Flavobacterium psychrophilum 		GCF_000064305.2_ASM6430v2_genomic.fna.gz
Plants and Fungi		Arabidopsis thaliana		GCF_000001735.4_TAIR10.1_genomic.fna.gz
Plants and Fungi		Saccharomyces cerevisiae		GCF_000146045.2_R64_genomic.fna.gz
Invertebrates		Caenorhabditis elegans		GCF_000002985.6_WBcel235_genomic.fna.gz
Invertebrates		Drosophila melanogaster		GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
Vertebrates 		Danio rerio 		GCF_000002035.6_GRCz11_genomic.fna.gz
Primates 		Homo sapiens 		GCF_000001405.39_GRCh38.p13_genomic.fna.gz
Rodents 		Mus musculus 		GCF_000001635.26_GRCm38.p6_genomic.fna.gz
Phages 		Chlamydia virus Chp2 		GCF_000849665.1_ViralProj14593_genomic.fna.gz
Viruses 		Norwalk virus 		GCF_000868425.1_ViralProj17577_genomic.fna.gz


That's it for this post, but these genomes will be utilized in future posts!