# Greengenes

## Background

Goal: generate a phylogenetic tree based on the 99% Greengenes tree.

## Preparation

### Dependencies

In [1]:
import gzip, bz2
from skbio import TreeNode

### Raw data

Suggested by Daniel.


OTU ID to taxnomic lineage map: [`gg_13_5_taxonomy.txt`](ftp://ftp.microbio.me/greengenes_release/gg_13_5/gg_13_5_taxonomy.txt.gz)

In [32]:
!bzcat < raw/gg_13_5_taxonomy.txt.bz2 | less | head -n5

228054	k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
844608	k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
178780	k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
198479	k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
187280	k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__


OTU ID to GenBank accession map: [`gg_13_5_accessions.txt.gz`](ftp://ftp.microbio.me/greengenes_release/gg_13_5/gg_13_5_accessions.txt.gz)

In [35]:
!bzcat < raw/gg_13_5_accessions.txt.bz2 | less | head -n5

#gg_id	accession_type	accession
4	Genbank	AB019749.1
7	Genbank	AB019734.1
13	Genbank	AF068817.2
14	Genbank	AF068820.2


99% OTU clustering scheme: [`99_otu_map.txt`](ftp://ftp.microbio.me/greengenes_release/gg_13_8_otus/otus/99_otu_map.txt)

In [36]:
!bzcat raw/99_otu_map.txt.bz2 | less | head -n5

42435	228054	844608	178780	198479	187280	179180	175058	176884
83889	228057	234102	234685	1121497	767731	230047	330751	317400	347564	352714	234168	231859	232604	233538	573838	136068	585338	4474077	1121583	4342576	4382430	4486293
192805	73627	785154	581446	177511	245190
154341	378462	398771	445143	394166	406264	391797	374752	497126
30887	89370	582313	300272	264371


99% OTU tree: [`gg_13_5_otus_99_annotated.tree`](ftp://ftp.microbio.me/greengenes_release/gg_13_5/gg_13_5_otus_99_annotated.tree.gz)

In [37]:
!cat raw/gg_13_5_otus_99_annotated.tree | less | head -n7

[
Wed Apr 24 15:04:56 2013: Loaded from /Users/philhugenholtz/Documents/greengenes/2013_2/gg_13_4_rc1/gg_13_4_rc1_23april2013.tree
Sun May 12 20:10:38 2013: tree_gg_13_4_rc1_23april2013 saved to /Users/philhugenholtz/Documents/greengenes/2013_4/gg_13_4_rc1_23april2013.tree
]
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((1018666:0.01057,
421164:0.00514
):0.00102,


## Analyses

### Read Greengenes database

Generate OTU ID to CBI accession (without version) map.

In [8]:
otu2accn = {}
with bz2.open('raw/gg_13_5_accessions.txt.bz2', 'rt') as f:
    # columns: gg_id, accession_type, accession
    for line in f:
        x = line.rstrip('\r\n').split('\t')
        if x[1] == 'Genbank':
            otu2accn[x[0]] = x[2].split('.')[0]
print('Number of Greengenes OTU IDs: %d.' % len(otu2accn))

Number of Greengenes OTU IDs: 1262440.


Extract unique accessions.

In [9]:
accns = set()
for accn in otu2accn.values():
    accns.add(accn)
print('Number of unique accessions: %d.' % len(accns))

Number of unique accessions: 1262130.


Read master OTU ID to co-clustered OTU IDs map.

In [10]:
otu_map = {}
with bz2.open('raw/99_otu_map.txt.bz2', 'rt') as f:
    for line in f:
        x = line.rstrip('\r\n').split('\t')
        otu_map[x[0]] = set(x[1:])
print('Number of 99%% OTUs: %d.' % len(otu_map))

Number of 99% OTUs: 203452.


Read Greengenes reference tree.

In [7]:
ggtree = TreeNode.read('raw/gg_13_5_otus_99_annotated.tree')
print('Number of taxa in 99%% Greengenes tree: %d' % ggtree.count(tips=True))

Number of taxa in 99% Greengenes tree: 203452


In [11]:
str(ggtree)[:400]

'(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((1018666:0.01057,421164:0.00514):0.00102,989926:0.0026):0.00014,892241:0.00348):0.00015,((1046178:0.00523,854915:0.00348):0.00046,1039981:0.0061):0.00043):0.00014,((((1087110:0.00962,958846:0.01053):0.00255,1020662:0.0087):0.00014,938027:0.00434):0.00014,(880205:0.01053,901170:0.007):0.0'

### Match OTU IDs with WoL genomes

Read nucleotide accession to TaxID map.
 - Based on the proper versions (timestamp: Mar. 7, 2017) of [`nucl_gb.accession2taxid.gz`](ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz) provided by NCBI.

In [52]:
%%script false
accn2tid = {}
with gzip.open('nucl_gb.accession2taxid.gz', 'rt') as f:
    # columns: accession, accession.version, taxid, gi
    for line in f:
        x = line.rstrip('\r\n').split('\t')
        if x[0] in accns:
            accn2tid[x[0]] = x[2]
print('Number of accessions with TaxID: %d.' % len(accn2tid))

`accn2tid.txt` is a shrinked local version

In [12]:
with open('accn2tid.txt', 'r') as f:
    accn2tid = dict(x.split('\t') for x in f.read().splitlines())
print('Number of accessions with TaxID: %d.' % len(accn2tid))

Number of accessions with TaxID: 1155110.


Reading genome ID to TaxID map.

In [35]:
g2tid = {}
with bz2.open('../../genomes/summary.txt.bz2', 'rt') as f:
    next(f)
    for line in f:
        x = line.rstrip('\r\n').split('\t')
        g2tid[x[0]] = x[21]

Optional: include taxa in the current ToL only.

In [36]:
with open('../../ToLs/taxa.txt', 'r') as f:
    gs = set(f.read().splitlines())
g2tid = {g: g2tid[g] for g in gs}

In [37]:
tid2gs = {}
for g, tid in g2tid.items():
    if tid not in tid2gs:
        tid2gs[tid] = set([g])
    else:
        tid2gs[tid].add(g)
print('Number of unique TaxIDs assigned to the %d genomes: %d.' % (len(g2tid), len(tid2gs)))

Number of unique TaxIDs assigned to the 10575 genomes: 9887.


### Shrink Greengenes tree to common taxa

Shrink the tree to only taxa assigned to TaxIDs represented by WoL genomes.

In [51]:
tree = ggtree.copy()
otu2tid = {}
otus_to_keep = []
for tip in tree.tips():
    otu = tip.name
    accn = otu2accn[otu]
    if accn in accn2tid:
        tid = accn2tid[accn]
        if tid in tid2gs:
            otu2tid[otu] = tid
            otus_to_keep.append(otu)
tree = tree.shear(otus_to_keep)
print('Number of tips assigned to TaxIDs represented by the genomes: %d' % tree.count(tips=True))

Number of tips assigned to TaxIDs represented by the genomes: 6871


In [52]:
tid2otus = {}
for otu, tid in otu2tid.items():
    if tid not in tid2otus:
        tid2otus[tid] = set([otu])
    else:
        tid2otus[tid].add(otu)
print('Number of unique TaxIDs represented by tips: %d' % len(tid2otus))

Number of unique TaxIDs represented by tips: 2234


Assess the cladistic properties of those TaxIDs.

In [53]:
cladistics = {'single': [], 'monophyly': [], 'paraphyly': []}
for tid, otus in sorted(tid2otus.items()):
    tips = [tree.find(x) for x in otus]
    if len(tips) == 1:
        cladistics['single'].append(tid)
    elif tree.lca(tips).subset() == otus:
        cladistics['monophyly'].append(tid)
    else:
        cladistics['paraphyly'].append(tid)
print('There are %d single tips, %d monophyletic groups and %d paraphyletic groups among those TaxIDs.'
      % (len(cladistics['single']), len(cladistics['monophyly']), len(cladistics['paraphyly'])))

There are 1827 single tips, 229 monophyletic groups and 178 paraphyletic groups among those TaxIDs.


In [54]:
with open('cladistics.tsv', 'w') as f:
    for cat in cladistics:
        for tid in sorted(cladistics[cat], key=int):
            f.write('%s\t%s\n' % (tid, cat))

Translate OTU IDs to TaxIDs at the tips.

In [55]:
# for single tips, translate OTU into TaxID
for tid in singles:
    tree.find(max(tid2otus[tid])).name = 'TaxID%s' % tid

# for monophyletic groups, collapse the entire clade into one taxon
nodes_to_remove = set()
for tid in monophylies:
    lca = tree.lca([tree.find(x) for x in tid2otus[tid]])
    lca.parent.append(TreeNode('TaxID%s' % tid, lca.descending_branch_length()))
    nodes_to_remove.add(lca)
tree.remove_deleted(lambda x: x in nodes_to_remove)

# delete all other tips
mapped_tips = [x.name for x in tree.tips() if x.name.startswith('TaxID')]
tree = tree.shear(mapped_tips)
for tip in tree.tips():
    tip.name = tip.name[5:]

print('After mapping OTUs to TaxIDs, there are %d tips in the tree.'
      % tree.count(tips=True))

After mapping OTUs to TaxIDs, there are 2056 tips in the tree.


In [56]:
tree.write('gg.tol.tid.nwk')

'gg.tol.tid.nwk'

Translate TaxIDs to genome IDs at the tips.

In [57]:
tips_to_remove = set()
for tip in [x for x in tree.tips()]:
    tip.parent.extend([TreeNode(x) for x in tid2gs[tip.name]])
    tips_to_remove.add(tip)
tree.remove_deleted(lambda x: x in tips_to_remove)
print('After mapping TaxIDs to genome IDs, there are %d tips in the tree.'
      % tree.count(tips=True))

After mapping TaxIDs to genome IDs, there are 2286 tips in the tree.


In [58]:
tree.write('gg.tol.gid.nwk')

'gg.tol.gid.nwk'