# GTDB processing

## Background

[GTDB](http://gtdb.ecogenomic.org/) (Genome Taxonomy Database) is a large-scale effort of curating microbial taxonomy based on phylogeny. The dataset includes NCBI RefSeq (GCF) and GenBank (GCA) genomes, as well as genomes they assembled from publicly available metagenomes (UBA). The current release - r86.1 - contains 125,243 bacterial genomes and 2,075 archaeal genomes, of which 27,372 bacterial genomes and 1,569 archaeal genomes are represented in two phylogenetic trees, respectively. The data are available at the [website](http://gtdb.ecogenomic.org/downloads).

## Preparation

### Dependencies

In [1]:
import pandas as pd
from skbio import TreeNode

### Input files

GTDB r86.1 release

In [2]:
bac_metadata_fp = 'bac_metadata_r86.tsv.xz'
bac_taxonomy_fp = 'bac_taxonomy_r86.tsv.xz'
bac_tree_fp = 'bac120_r86.1.tree'

In [3]:
arc_metadata_fp = 'arc_metadata_r86.tsv.xz'
arc_taxonomy_fp = 'arc_taxonomy_r86.tsv.xz'
arc_tree_fp = 'arc122_r86.1.tree'

WoL data

In [4]:
wgs_fp = '../../../genomes/glists/fna_prok.txt'
wgis_fp = '../../../genomes/glists/in.txt'
wtree_fp = '../../../trees/astral/newick/astral.nid.nwk'

### Examination of GTDB files

Three files are relevant to this study (exemplified here are `bac`, while there are also `arc`):

Metadata: `bac_metadata_r86.tsv`

In [5]:
!xzcat $bac_metadata_fp | cut -f1-5 | head -n5

accession	scaffold_count	gc_count	longest_scaffold	gc_percentage
RS_GCF_001245025.1	157	2457823	329958	52.2585149986
RS_GCF_000678935.1	4	2880443	3928455	65.5995847821
RS_GCF_000020485.1	1	976684	2578146	37.8831920302
RS_GCF_001206855.1	75	837175	119381	39.5136727904


Taxonomy: `bac_taxonomy_r86.tsv`

In [6]:
!xzcat $bac_taxonomy_fp | cut -c1-100 | head -n5

RS_GCF_001300075.1	d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__St
RS_GCF_001245025.1	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enter
RS_GCF_000678935.1	d__Bacteria;p__Actinobacteriota;c__Actinobacteria;o__Corynebacteriales;f__Coryneb
RS_GCF_000020485.1	d__Bacteria;p__Firmicutes_F;c__Halanaerobiia;o__Halanaerobiales;f__Halothermotric
RS_GCF_001206855.1	d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__St


Tree: `bac120_r86.1.tree`

In [7]:
!cat $bac_tree_fp | cut -c1-300

(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((RS_GCF_900110195.1:1e-05,RS_GCF_900187485.1:1e-05)100.0:1e-05,RS_GCF_000783395.1:2e-05)64.0:1e-05,RS_GCF_000802985.1:2e-05)92.0:1e-05,(RS_GCF_002003425.1:1e-05,RS_GCF_900187515.1:2e-05)60.0:1e-05)39.0:1e-05,RS_GCF_000276585.1:2e-


## Processing

### GTDB identifiers

Read GTDB genome IDs.

In [8]:
arc_ = !xzcat $arc_metadata_fp | cut -f1 | tail -n+2
bac_ = !xzcat $bac_metadata_fp | cut -f1 | tail -n+2
ggs = arc_ + bac_
len(ggs)

127318

Extract UBA genomes

In [9]:
uba = sorted([x for x in ggs if x.startswith('UBA')], key=lambda x: int(x[3:]))
len(uba)

3130

In [10]:
print(', '.join(uba[:5]))

UBA7904, UBA7905, UBA7906, UBA7907, UBA7908


Generate a GTDB to WoL genome ID map. e.g.: `GB_GCA_000006155.2` => `GC000006155`.

In [11]:
g2w = {x: 'G%s' % x.split('_')[-1].split('.')[0]
       for x in ggs if x.startswith('GB_GCA_')
       or x.startswith('RS_GCF_')}
len(g2w)

124188

Check for duplicate genomes in GTDB (same assembly ID represented by both RefSeq and GenBank versions).

In [12]:
gdups = set()
used = set()
for g, w in g2w.items():
    if w in used:
        gdups.add(w)
    else:
        used.add(w)
gdups

{'G000009205', 'G000821245', 'G001645235'}

Read WoL genome IDs.

In [13]:
wgs = !cat ../../../genomes/glists/fna_prok.txt
len(wgs)

86200

Identify shared genome IDs.

In [14]:
sgs = set(wgs).intersection(set(g2w.values()))
len(sgs)

83466

Read genome IDs in WoL tree.

In [15]:
wigs = !cat ../../../genomes/glists/in.txt
len(wigs)

10575

Identify shared genome IDs in the tree.

In [16]:
sigs = sgs.intersection(wigs)
len(sigs)

9732

Export WoL-to-GTDB translation table. If there are duplicates (GCF (RefSeq) + GCA (GenBank)), select the GCF.

In [17]:
w2g = {}
with open('glists/wol2gtdb.txt', 'w') as f:
    for g, w in sorted(g2w.items(), key=lambda x: x[1]):
        if w in gdups:
            print('%s: %s' % (w, g))
            if 'GCF' not in g:
                continue
        w2g[w] = g
        f.write('%s\t%s\n' % (w, g))

G000009205: RS_GCF_000009205.1
G000009205: GB_GCA_000009205.2
G000821245: RS_GCF_000821245.1
G000821245: GB_GCA_000821245.2
G001645235: GB_GCA_001645235.2
G001645235: RS_GCF_001645235.1


### Taxonomy

In [18]:
ranks = ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']

Load GTDB taxonomy

In [19]:
taxfile = !xzcat $arc_taxonomy_fp $bac_taxonomy_fp

In [20]:
data = {}
for line in taxfile:
    g, lineage = line.split('\t')
    data[g] = {}
    taxa = lineage.split(';')
    if len(taxa) != len(ranks):
        raise ValueError(taxa)
    for i, taxon in enumerate(taxa):
        if len(taxon) < 3:
            raise ValueError(taxon)
        if taxon[1:3] != '__':
            raise ValueError(taxon)
        if taxon[0] != ranks[i][0]:
            raise ValueError(taxon)
        data[g][ranks[i]] = taxon[3:]
len(data)

127318

In [21]:
gtax = pd.DataFrame().from_dict(data, orient='index')
gtax.index.name = 'genome'
gtax.head()

Unnamed: 0_level_0,domain,phylum,class,order,family,genus,species
genome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
GB_GCA_000006155.2,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae_G,Bacillus_A,Bacillus_A anthracis
GB_GCA_000007185.1,Archaea,Euryarchaeota,Methanopyri,Methanopyrales,Methanopyraceae,Methanopyrus,Methanopyrus kandleri
GB_GCA_000007225.1,Archaea,Crenarchaeota,Thermoprotei,Thermoproteales,Thermoproteaceae,Pyrobaculum,Pyrobaculum aerophilum
GB_GCA_000007385.1,Bacteria,Proteobacteria,Gammaproteobacteria,Xanthomonadales,Xanthomonadaceae,Xanthomonas,Xanthomonas oryzae
GB_GCA_000007405.1,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli


In [22]:
gtax.to_csv('ranks/gtdb.tsv', sep='\t')

Note: The GTDB taxonomy is filled in all cells (i.e., there is no unclassified ranks).

In [23]:
for rank in gtax.columns:
    print('%s: %d' % (rank, gtax[rank].nunique()))

domain: 2
phylum: 125
class: 324
order: 873
family: 2025
genus: 6918
species: 11551


Downsample to genomes with WoL-style IDs.

In [24]:
gwtax = gtax.loc[sorted(g2w)]
gwtax['gid'] = gwtax.index.to_series().map(g2w)
gwtax.shape[0]

124188

Make sure duplicated genome IDs have identical taxonomy annotations.

In [25]:
vc = gwtax['gid'].value_counts()
dups = vc[vc > 1].index.tolist()
dups

['G001645235', 'G000009205', 'G000821245']

In [26]:
for dup in dups:
    rows = gwtax[gwtax.gid == dup]
    for rank in ranks:
        if len(set(rows[rank].tolist())) > 1:
            raise ValueError('%s has conflicting taxonomy annotations.' % dup)

Subset and translate the genome IDs.

In [27]:
gwtax = gwtax.sort_values('gid').drop_duplicates('gid', keep='first').set_index('gid')
gwtax.shape[0]

124185

In [28]:
gwtax.head()

Unnamed: 0_level_0,domain,phylum,class,order,family,genus,species
gid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
G000003135,Bacteria,Actinobacteriota,Actinobacteria,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum
G000003215,Bacteria,Firmicutes_A,Clostridia,Peptostreptococcales,Peptostreptococcaceae,Clostridioides,Clostridioides difficile
G000003645,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae_G,Bacillus_A,Bacillus_A cereus_AE
G000003925,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae_G,Bacillus_A,Bacillus_A mycoides
G000003955,Bacteria,Firmicutes,Bacilli,Bacillales,Bacillaceae_G,Bacillus_A,Bacillus_A cereus_P


In [29]:
for rank in gwtax.columns:
    print('%s: %d' % (rank, gwtax[rank].nunique()))

domain: 2
phylum: 125
class: 317
order: 848
family: 1943
genus: 6655
species: 11463


In [30]:
gwtax.to_csv('ranks/gtdb.wid.tsv', sep='\t')

Downsample to genomes shared between GTDB and WoL.

In [31]:
gwtax = gwtax.loc[sorted(sgs)]
for rank in gwtax.columns:
    print('%s: %d' % (rank, gwtax[rank].nunique()))

domain: 2
phylum: 103
class: 241
order: 600
family: 1297
genus: 4380
species: 9615


In [32]:
gwtax.to_csv('ranks/gtdb.wid.prok.tsv', sep='\t')

Downsample to genomes in the WoL tree.

In [33]:
gwtax = gwtax.loc[sorted(sigs)]
for rank in gwtax.columns:
    print('%s: %d' % (rank, gwtax[rank].nunique()))

domain: 2
phylum: 94
class: 215
order: 523
family: 1111
genus: 3619
species: 5503


In [34]:
gwtax.to_csv('ranks/gtdb.wid.in.tsv', sep='\t')

List additional phyla in GTDB but not in the WoL tree.

In [35]:
set(gtax['phylum'].unique()) - set(gwtax['phylum'].unique())

{'4572-55',
 'AABM5-125-24',
 'BRC1',
 'CG03',
 'CG2-30-53-67',
 'CG2-30-70-394',
 'Desantisbacteria',
 'Entotheonellota',
 'Eremiobacterota',
 'Fermentibacterota',
 'Firestonebacteria',
 'Goldbacteria-1',
 'OLB16',
 'RBG-13-61-14',
 'UAP2',
 'UBA2233',
 'UBA3054',
 'UBA5359',
 'UBA6262',
 'UBA8481',
 'UBP1',
 'UBP10',
 'UBP12',
 'UBP13',
 'UBP14',
 'UBP17',
 'UBP18',
 'UBP3',
 'UBP4',
 'UBP6',
 'UBP7'}

### Phylogeny

#### Read GTDB trees

In [36]:
wtree = TreeNode.read(wtree_fp)
wtree.count(tips=True)

10575

In [37]:
gatree = TreeNode.read(arc_tree_fp, convert_underscores=False)
gatree.count(tips=True)

1569

In [38]:
gbtree = TreeNode.read(bac_tree_fp, convert_underscores=False)
gbtree.count(tips=True)

27372

Export trees without internal node labels but only support values

In [39]:
gatree_ = gatree.copy()
for node in gatree_.non_tips(include_self=True):
    if node.name:
        node.name = node.name.split(':')[0].split('.')[0]
gatree_.write('trees/arc122_r86.1.bs_only.nwk')

'trees/arc122_r86.1.bs_only.nwk'

In [40]:
gbtree_ = gbtree.copy()
for node in gbtree_.non_tips(include_self=True):
    if node.name:
        node.name = node.name.split(':')[0].split('.')[0]
gbtree_.write('trees/bac120_r86.1.bs_only.nwk')

'trees/bac120_r86.1.bs_only.nwk'

#### Overlap taxon sets

Get genome IDs in GTDB trees.

In [41]:
gatgs = gatree.subset()
gbtgs = gbtree.subset()

In [42]:
gtgs = sorted(list(gatgs) + list(gbtgs))
len(gtgs)

28941

In [43]:
print(', '.join(gtgs[:5]))

GB_GCA_000007185.1, GB_GCA_000007225.1, GB_GCA_000008085.1, GB_GCA_000010565.1, GB_GCA_000011005.1


Export GTDB tree taxon list.

In [44]:
with open('glists/gid_in_gtrees.txt', 'w') as f:
    for g in gtgs:
        f.write('%s\n' % g)

Get WoL-style genome IDs of GTDB tree taxa.

In [45]:
gtwgs = sorted([g2w[x] for x in gtgs if x in g2w])
len(gtwgs)

27222

Get GTDB tree taxa that are in the WoL genome pool.

In [46]:
sgtwgs = sgs.intersection(gtwgs)
len(sgtwgs)

17514

In [47]:
with open('glists/wid_in_gtrees.prok.txt', 'w') as f:
    for g in sorted(sgtwgs):
        f.write('%s\n' % g)

Get GTDB tree taxa that are in the WoL tree.

In [48]:
sgtwigs = sorted(set(wigs).intersection(gtwgs))
len(sgtwigs)

8042

In [49]:
with open('glists/wid_in_gtrees.in.txt', 'w') as f:
    for g in sorted(sgtwigs):
        f.write('%s\n' % g)

Shrink the two trees to contain common taxa only.

In [50]:
gatgs_to_keep = [x for x in gatgs if x in g2w and g2w[x] in sgtwigs]
gatree_ = gatree_.shear(gatgs_to_keep)
gatree_.count(tips=True)

486

In [51]:
gatree_.write('trees/arc122_r86.1.lap.nwk')

'trees/arc122_r86.1.lap.nwk'

In [52]:
for tip in gatree_.tips():
    tip.name = g2w[tip.name]
gatree_.write('trees/arc122_r86.1.lap.wid.nwk')

'trees/arc122_r86.1.lap.wid.nwk'

In [53]:
gbtgs_to_keep = [x for x in gbtgs if x in g2w and g2w[x] in sgtwigs]
gbtree_ = gbtree_.shear(gbtgs_to_keep)
gbtree_.count(tips=True)

7556

In [54]:
gbtree_.write('trees/bac120_r86.1.lap.nwk')

'trees/bac120_r86.1.lap.nwk'

In [55]:
for tip in gbtree_.tips():
    tip.name = g2w[tip.name]
gbtree_.write('trees/bac120_r86.1.lap.wid.nwk')

'trees/bac120_r86.1.lap.wid.nwk'

Arbitrarily connect two trees at roots

In [56]:
gtree_ = TreeNode(children=[gatree_, gbtree_])
gtree_.count(tips=True)

8042

In [57]:
gtree_.write('trees/joint.lap.wid.nwk')

'trees/joint.lap.wid.nwk'

In [58]:
wtree_ = wtree.shear(sgtwigs)
wtree_.count(tips=True)

8042

In [59]:
wtree_.write('trees/astral.lap.nwk')

'trees/astral.lap.nwk'

In [60]:
t_ = wtree_.copy()
for tip in t_.tips():
    tip.name = w2g[tip.name]
t_.write('trees/astral.lap.gid.nwk')

'trees/astral.lap.gid.nwk'