# Treemap

----------------

This Jupyter notebook walks through the steps of creating a treemap that can be used to display hierarchical information. As it is now it is designed to show how reads from a metagenomic study are distributed across the different branches of living organisms.

*written by Gytis Dudas, 2019*

## Step 1: process and prepare archived NCBI taxonomy file (optional)

NCBI updated virus taxonomy over the course of this project based on ICTV proposals, but this change has been incomplete _e.g._ most sequences under `ssRNA negative-strand viruses` (descriptive category) have been moved to `Negarnaviricota` (genealogical category) but >100 accessions that are distinctly negative sense RNA viruses still remain under the old category following the evacuation. For the sake of consistency we have decided to go with the older version of taxonomy. 

This cell downloads the required version of taxonomy, unzips it and tar+gzips it, which is the only format `ete3` will take.

In [1]:
# %%bash

# store_folder=/Users/evogytis/Downloads
# cd $store_folder

# # tax_db="taxdmp_2019-01-01" ## latest taxonomy release that still contains original virus taxonomy
# tax_db="taxdmp_2019-12-01" ## latest taxonomy release
# curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/$tax_db.zip ## download taxonomy file

# rm -rf $store_folder/$tax_db ## remove existing category
# mkdir $store_folder/$tax_db

# unzip -o $store_folder/$tax_db.zip -d $store_folder/$tax_db

# cd $store_folder/$tax_db; tar -czvf $store_folder/$tax_db.tar.gz *[dmp,prt,txt]

## Step 2: load libraries, update ete3's taxonomy to file provided

This cell loads three native libraries (`os`, `json`, and `glob`) and `ete3`. `ete3` is required to place all BLAST hits in the treemap via the `get_lineage` command. Updating the taxonomy database takes ~3 minutes.

Also loaded is a CD-HIT file of clustered contigs, and file mapping numbers of reads to each contig.

In [2]:
import ete3
import os,json,glob

# taxonomy_path='/Users/evogytis/Downloads/taxdmp_2019-01-01.tar.gz'
taxonomy_path='/Users/evogytis/Downloads/taxdmp_2019-12-01.tar.gz'
base_path='/Users/evogytis/Documents/manuscripts/skeeters/data' ## point to where the data folder to the repo is locally


ncbi=ete3.ncbi_taxonomy.NCBITaxa()
# ncbi.update_taxonomy_database(taxdump_file=taxonomy_path) ## uncomment to update ete3's taxonomy

#########
## Load clusters of homologous contigs >500 nt
#########
from collections import defaultdict, namedtuple
Member = namedtuple('Member', ['contig', 'length', 'percent_id', 'percent_id_sign', 'sample', 'coverage'])

def parse_cdhit_row(row):
    if '*' in row:
        index, length, name, percent_id = row.split()
        percent_id_sign, percent_id = '0', 100
    else:
        index, length, name, _, percent_id = row.split()
    length = int(length.strip(',nt'))
    name = name.strip('>').strip('.')
    sample, contig = name.split('~')
    coverage = float(contig.split('_')[-1])
    
    if percent_id != 100:
        percent_id_sign, percent_id = percent_id.strip('%').split('/')
        percent_id = float(percent_id)
    return Member(contig=contig, sample=sample, length=length,
                  percent_id=percent_id, percent_id_sign=percent_id_sign, coverage=coverage)

clusters = defaultdict(list)
with open(base_path+'/500_contigs_cluster.clstr', 'r') as file:
    for line in file:
        if line.startswith('>Cluster'):
            cluster_id = line.split()[-1]
        else:
            member = parse_cdhit_row(line)
            if 'water' in member.sample.lower():
                continue
            clusters[cluster_id].append(member)
            
contig2cluster={}
for cluster_id in clusters:
    for m in clusters[cluster_id]:
        contig2cluster[m.contig]=cluster_id

#########
## Load read counts of each contig
#########
contig_reads={} ## will contain a mapping of sample: contig ID: number of reads
for reads_file in glob.glob(os.path.join(base_path,'s3/contig_quality/*/contig_stats_all.tsv')): ## iterate over sample LCA files
    fsample=os.path.basename(os.path.dirname(reads_file))
    if 'water' not in fsample.lower():
        for line in open(reads_file,'r'): ## iterate over lines
            l=line.strip('\n').split('\t')
            if l[0]=='sample':
                header={x:i for i,x in enumerate(l)} ## create header dict
            else:
                sample=l[header['sample']]
                contig_name=l[header['contig_name']] ## get contig name
                read_count=int(float(l[header['read_count']])) ## get read count

                assert sample==fsample ## sanity check that sample name in file is same as folder name
                if sample not in contig_reads:
                    contig_reads[sample]={}

                contig_reads[sample][contig_name]=read_count

## Step 3: load pre-defined taxa that will be displayed in the treemap
-----

The following cell loads a file (`displayed_taxa_reads.json`) that looks like this:

```
[{"taxid": 1,
"taxonomy": "root"},
{"taxid": 10239,
"taxonomy": "Viruses"},
{"taxid": 131567,
"taxonomy": "cellular organisms"},
{"taxid": 2157,
"taxonomy": "Archaea"},
{"taxid": 2759,
"taxonomy": "Eukaryota"},
{"taxid": 2,
"taxonomy": "Bacteria"}]
```

It is loaded as a flat list of branches, which get annotated later (here with contig and read counts) and in the last steps built into a nested tree data structure (_i.e._ here it would be `(Root(Viruses,CellularOrganisms(Eukaryota,Bacteria,Archaea)));`).

By default the produced treemap will only contain the taxids listed in this file and anything higher _e.g._ _Metazoa_ would be traversed back to a taxid that does exist in the file which would be _Eukaryota_.

At the end of the cell additional branches are added to the tree. These are high-order taxonomic lineages that are currently not listed by NCBI (_e.g._ the Narna-Levi supergroup that links _Narnaviridae_ and _Leviviridae_), additionally linked into an even higher order structure based on strandedness (_i.e._ Baltimore class) which can be paraphyletic but is done here to introduce more structure into the treemap.

In [3]:
J=json.load(open(os.path.join(base_path,'../treemap/displayed_taxa_reads.json'),'r')) ## load designated treemap branches
branches={b['taxid']:b for b in J} ## flat list of branches indexed by taxid

remove_branches=[] ## empty list that will contain taxids whose lineage cannot be recovered by ete3 (because of out-dated taxonomy)
for taxid in branches: ## iterate over branches loaded so far
    try:
        ncbi.get_lineage(taxid) ## attempt to get lineage
    except ValueError: ## attempt failed
        remove_branches.append(taxid) ## remember taxid for removal later
        print('taxid %s not in taxdump.tar.gz file loaded earlier, it will be excluded'%(taxid))
        
for taxid in remove_branches:
    branches.pop(taxid) ## remove taxids that failed
        
branches['uncurated']={'taxonomy':'uncurated','taxid':'uncurated','attrs':{'colour': '#E7E7E6'}} ## also create a branch that could contain no hits (i.e. total number of queries minus number of queries that hit something)

for b in branches: ## assign default colour to branches
    branches[b]['attrs']={'colour':'#E7E7E6'} ## default colour is slategrey, but later cell assigns a colour based on descent from Bacteria, Eukaryotes or Viruses
    
print('current taxids that exist as branches: %s\n'%(branches.keys()))

###########
## This code will add additional treemap compartments that group viral families into higher-order structures which don't exist in official NCBI taxonomy
###########
fam_taxids={249184: 'Tymoviridae',675071: 'Virgaviridae', 
            2560063: 'Botourmiaviridae', 186766: 'Narnaviridae', 11989: 'Leviviridae',
            119163: 'Luteoviridae', 2169577: 'Solemoviridae', 
            39738: 'Tombusviridae', 12283: 'Nodaviridae', 
            11006: 'Totiviridae', 249310: 'Chrysoviridae', 
            11050: 'Flaviviridae', 10880: 'Reoviridae', 
            11012: 'Partitiviridae', 464095: 'Picornavirales'} ## link taxids to family names

## there are 12 supergroups, but Partitis, Reos and Flavis are their own supergroups, whereas Picornas, Mononegs, Orthomyxos, and Bunyas exist as supergroup taxids on NCBI
supergroups={'Tymoviridae': 'Hepe-Virga',
             'Virgaviridae': 'Hepe-Virga',
             'Botourmiaviridae': 'Narna-Levi', 
             'Narnaviridae': 'Narna-Levi', 
             'Leviviridae': 'Narna-Levi',
             'Luteoviridae': 'Luteo-Sobemo', 
             'Solemoviridae': 'Luteo-Sobemo', 
             'Tombusviridae': 'Tombus-Noda', 
             'Nodaviridae': 'Tombus-Noda', 
             'Totiviridae': 'Toti-Chryso', 
             'Chrysoviridae': 'Toti-Chryso'}

## adds an extra layer to treemap which is strandedness
strands={10880: 'dsRNA viruses', ## Reo
         'Toti-Chryso': 'dsRNA viruses', 
         11012: 'dsRNA viruses', ## Partiti
         'Luteo-Sobemo': '(+)ssRNA viruses', 
         'Tombus-Noda': '(+)ssRNA viruses', 
         'Narna-Levi': '(+)ssRNA viruses', 
         11050: '(+)ssRNA viruses', ## Flavi
         'Hepe-Virga': '(+)ssRNA viruses', 
         464095: '(+)ssRNA viruses'} ## Picorna

high_order_lineage={} ## will contain new high-order taxid lineages (adds supergroup classification)

for fam in fam_taxids: ## iterate over family taxids
    if fam_taxids[fam] in supergroups: ## family is part of supergroup
        print('family %s (%s supergroup)'%(fam_taxids[fam],supergroups[fam_taxids[fam]]))
        sg=supergroups[fam_taxids[fam]] ## get supergroup of family
        st=strands[sg] ## get strand of supergroup
        high_order_lineage[fam]=[1,10239,st,sg,fam] ## assign new family lineage with supergroup and strand
    else:
        sg=fam ## family is its own supergroup
        print('supergroup family %s'%(fam_taxids[fam]))
        st=strands[sg] ## get strand of supergroup
    
    if st not in branches: ## create new branches
        branches[st]={'attrs':{'colour': '#E7E7E6'},'taxid': st, 'taxonomy': st}
    if sg not in branches:
        branches[sg]={'attrs':{'colour': '#E7E7E6'},'taxid': sg, 'taxonomy': sg}
    
    high_order_lineage[sg]=[1,10239,st,sg] ## assign strand+supergroup lineage
    high_order_lineage[st]=[1,10239,st] ## assign strand lineage
    
high_order_lineage['uncurated']=[1,10239,'uncurated'] ## uncurated contig reads will sit under Viruses

current taxids that exist as branches: dict_keys([1, 131567, 2, 953, 1236, 28211, 203691, 91347, 135619, 1783272, 4751, 2759, 1437010, 35500, 33554, 9989, 9979, 8782, 9126, 5654, 1286322, 5690, 1206794, 33213, 6029, 451864, 6231, 5794, 33154, 7711, 10239, 2497569, 11157, 11308, 1980410, 1980417, 1980418, 1980416, 1299308, 11270, 2501985, 186766, 11989, 2560063, 2169577, 119163, 11012, 11050, 39738, 12283, 10880, 11006, 249310, 464095, 699189, 232795, 249184, 675071, 'uncurated'])

family Tymoviridae (Hepe-Virga supergroup)
family Virgaviridae (Hepe-Virga supergroup)
family Botourmiaviridae (Narna-Levi supergroup)
family Narnaviridae (Narna-Levi supergroup)
family Leviviridae (Narna-Levi supergroup)
family Luteoviridae (Luteo-Sobemo supergroup)
family Solemoviridae (Luteo-Sobemo supergroup)
family Tombusviridae (Tombus-Noda supergroup)
family Nodaviridae (Tombus-Noda supergroup)
family Totiviridae (Toti-Chryso supergroup)
family Chrysoviridae (Toti-Chryso supergroup)
supergroup family F

## Step 4: annotation of branches
------

In this cell we add the information about how many reads (and contigs) belong to each taxonomic compartment. The `summarise` option allows you to choose between relying on the treemap compartments specified in the `displayed_taxa_reads.json` file or creating a new treemap compartment for every new taxid seen in the BLAST results summary file.

If `backbone` is chosen every taxid in BLAST output summary is traversed back to the root and all reads get assigned to the first treemap compartment available, _e.g._ if BLAST hit was _Metazoa_ all the reads mapping to the contig would be assigned to _Eukaryota_ which is the highest treemap compartment available.

If `all` is chosen a contig that's been assigned to _Metazoa_ will create a new treemap compartment with all the reads from the contig assigned to it. Future contigs assigned to _Metazoa_ will contribute their reads to this compartment.

In [4]:
summarise='backbone'
# summarise='all'

in_json=open(os.path.join(base_path,'darkmatter/virus.json'),'r')
virus=json.load(in_json)

redirect={} ## will redirect some viruses from the taxids they were submitted under (usually under "unclassified Viruses") to something more informative

viral_reads={}
for line in open(os.path.join(base_path,'s3/contig_quality_concat/viral_decontam.tsv'),'r'):
    l=line.strip('\n').split('\t')
#     print(l)
    if l[0]=='sample':
        header={x: i for i,x in enumerate(l)}
    else:
        pol_group=l[header['poly_group']]
        sample=l[header['sample']]
        reads=int(l[header['reads']])
        
        if pol_group not in viral_reads:
            viral_reads[pol_group]={}
        if sample not in viral_reads[pol_group]:
            viral_reads[pol_group][sample]=0
            
        viral_reads[pol_group][sample]+=reads ## store reads for virus+sample


for pol_group in virus: ## iterate over polymerase groups
    taxid=virus[pol_group]['taxid'] ## get assigned taxid to virus
    subm_taxid=virus[pol_group]['submission_taxid'] ## get the taxid under which the virus will be submitted (may belong to a group sitting under "unclassified Viruses")
    
    name=virus[pol_group]['provisional_name'] if 'provisional_name' in virus[pol_group] else virus[pol_group]['name'] ## get the official or provisional name of the virus (human-readable)
    
    if taxid!=subm_taxid: ## assigned taxid does not match submission taxid (previously described viruses under incorrect taxid in the database)
        print('previously described virus %s with official taxid %s will be redirected as child of %s'%(name,subm_taxid,taxid))
        redirect[subm_taxid]=taxid ## add taxid for redirection later
        vir_id=subm_taxid ## virus ID is the incorrect taxid
    elif 'provisional_name' in virus[pol_group]: ## new virus, still needs redirect
        redirect[name]=taxid ## since new viruses don't exist under their own taxid it will need to be created later
        vir_id=name
    else: ## taxid matches assigned taxid, not a new virus and sits at the appropriate taxid
        vir_id=taxid
    
    branches[vir_id]={'attrs':{'colour': '#E7E7E6'}} ## create a new branch for treemap
    branches[vir_id]['taxonomy']=name ## give it a name
    branches[vir_id]['taxid']=vir_id ## give whatever was closest to a taxid (provisional name for new viruses)
    
    for sample in viral_reads[pol_group]:
        branches[vir_id]['attrs'][sample]={'read_count': viral_reads[pol_group][sample]} ## assign reads to virus
    
    
nonviral_reads={}
for line in open(os.path.join(base_path,'s3/contig_quality_concat/lca_decontam.tsv'),'r'):
    l=line.strip('\n').split('\t')
    
    if l[0]=='sample':
        header={x: i for i,x in enumerate(l)}
    else:
        taxid=int(l[header['taxid']])
        sample=l[header['sample']]
        reads=int(l[header['reads']])
        
        try:
            lineage=ncbi.get_lineage(taxid)
            
            if summarise=='all': ## adding every taxid
                if taxid not in branches:
                    branches[taxid]={'attrs':{'colour': '#E7E7E6'}} ## create a new branch for treemap
                    branches[taxid]['taxonomy']=ncbi.get_taxid_translator([taxid])[taxid] ## give it a name
                    branches[taxid]['taxid']=taxid ## assign taxid
            
            if 10239 in lineage: ## viral lineage
                if sample not in branches['uncurated']['attrs']:
                    branches['uncurated']['attrs'][sample]={'read_count': 0}
                branches['uncurated']['attrs'][sample]['read_count']+=reads ## assign reads to uncurated part
            else:
                for rank in lineage[::-1]:
                    if rank in branches:
                        branch=branches[rank]
                        if sample not in branch['attrs']:
                            branch['attrs'][sample]={'read_count': 0}
                        branch['attrs'][sample]['read_count']+=reads
                        break
        except ValueError:
            print('taxid %s not available'%(taxid))

previously described virus Culex pipiens-associated Tunisia virus with official taxid 2079148 will be redirected as child of 1527522
previously described virus Hubei chryso-like virus 1 with official taxid 1922855 will be redirected as child of 2587491
previously described virus Culex bunya-like virus with official taxid 2304497 will be redirected as child of 39718
previously described virus Guadeloupe Culex tymo-like virus with official taxid 2607736 will be redirected as child of 330383
previously described virus Guadeloupe mosquito virus with official taxid 2607735 will be redirected as child of 2600328
previously described virus Wenzhou sobemo-like virus 4 with official taxid 1923660 will be redirected as child of 502177
previously described virus Wuhan insect virus 33 with official taxid 1923736 will be redirected as child of 336635
previously described virus Hubei virga-like virus 2 with official taxid 1923335 will be redirected as child of 1527522




## Step 5: build tree structure, output to file
-------

All the branches have been annotated up to this stage and now it's time to build the tree data structure. We start by summarising the contig and read counts across samples. We then specify colours for particular compartment and their descendants (_e.g._ -ssRNA viruses obviously need to be red-ish). The procedure involves fetching branch A, retrieving its lineage and looking for taxids along its lineage which exist as a treemap compartment (let's call it B). Treemap compartment A gets assigned as a child of treemap compartment B.

The resulting tree structure is saved to file as a JSON and can be inspected using javascript code included in the repository by running a local server with the command and opening `treemap.html`:
```
python3 -m http.server 4000
```

In [5]:
taxa_tree={'children':[],'attrs':{'colour':'#E7E7E6'}} ## tree structure (+ add no-hit branch from the beginning)

sorted_taxids=branches

for b in sorted_taxids: ## iterate through flat list of branches
    reads=[branches[b]['attrs'][c]['read_count'] for c in branches[b]['attrs'] if 'CMS' in c]
#     contigs=[branches[b]['attrs'][c]['contig_count'] for c in branches[b]['attrs'] if 'CMS' in c]
    branches[b]['attrs']['read_count']=sum(reads)
#     branches[b]['attrs']['contig_count']=sum(contigs) ## compute sum of contig counts across samples  
    branches[b]['attrs']['sample_count']=len(reads)

## Colours assigned to a treemap compartment (all descendants inherit parental colour unless they have their own assigned colour)
lineage_colours={'dsRNA viruses': 'grey', ## dsRNA
                 '(+)ssRNA viruses': 'grey',  ## (+)ssRNA
                 2497569: 'grey', ## (-)ssRNA
                 10239: 'grey', ## viruses
                 11157: '#FF7D48', ## Mononegas
                 1980410: '#AC6569', ## Bunyas
                 11308: '#C43A3F', ## Orthomyxos
                 11050: '#0074C4', ## Flavi 
                 464095: '#42C754', ## Picorna
                 11012: '#965DAE', ## Partiti
                 10880: '#866E8B', ## Reo
                 'Hepe-Virga': '#AFD1FF', ## Hepe-Virga
                 'Narna-Levi': '#6967BC', ## Narna-Levi
                 'Luteo-Sobemo': '#E6C930', ## Luteo-Sobemo
                 'Toti-Chryso': '#D8A5CE', ## Toti-Chryso
                 'Tombus-Noda': '#618F80', ## Tombus-Noda
                 'uncurated': '#ACADAE', ## uncurated viral contigs
                 2: '#867A5F', ## Bacteria
                 953: '#86665F', ## Wolbachia
                 2759: '#E5E4E2', ## Eukaryota
                 5794:'#5E716A', ## apicomplexa
                 7711: '#B2BEB5', ## chordata
                 8782: '#6082B6', ## Aves
                 9347: '#658EA9', ## Eutheria
                 4751: '#808080', ## Fungi
                 5654: '#708090' ## Trypanosomatidae
                }

for taxid in sorted_taxids: ## iterate over every taxon in treemap
    taxon=branches[taxid] ## fetch branch
    
    if taxid in redirect: ## taxid has been redirected because the official taxid is imprecise
        lineage=ncbi.get_lineage(redirect[taxid])+[taxid]
    elif taxid in high_order_lineage: ## taxid is being adjusted at a higher taxonomic level
        lineage=high_order_lineage[taxid]
    else:
        lineage=ncbi.get_lineage(taxid) ## taxid is correct as assigned

    for r,rank in enumerate(lineage): ## iterate over lineage recovered
        if rank in high_order_lineage: ## lineage contains high-order readjustment
            lineage=high_order_lineage[rank]+lineage[r+1:] ## insert new lineage

    for rank in lineage: ## iterate over (potentially) new lineage
        if rank in branches and rank in lineage_colours: ## lineage exists and has colour assignment
            taxon['attrs']['colour']=lineage_colours[rank] ## assign colour

    if len(lineage)==1 and taxon not in taxa_tree['children']: ## if root
        taxa_tree['children'].append(taxon) ## add root to tree

    for lin in lineage[::-1][1:]: ## iterate through lineage, starting from most recent, ignore first entry (self)
        if lin in branches: ## rank present amongst branches
            parent=branches[lin] ## grab parent

            if 'children' not in parent: ## if parent doesn't have children yet - add the attribute
                parent['children']=[]

            if taxon not in parent['children']: ## branch wasn't assigned to its parent yet
                parent['children'].append(taxon) ## add child to parent

            assert taxon['taxid']!=parent['taxid'], 'parent is child %s %s, lineage: %s'%(taxon['taxid'],taxon['taxonomy'],lineage)
            break ## break for loop, parent has been identified
    
json.dump(taxa_tree,open(os.path.join(base_path,'../treemap/skeeters.json'),'w'),indent=1,sort_keys=True) ## write json out to repo

print('Done!')

Done!


In [6]:
def table_treemap(node,samples,stat,file=None):
    if 'taxid' in node:
        sample_reads=[node['attrs'][s][stat] if s in node['attrs'] else 0 for s in samples]
        if sum(sample_reads)>0:
            row='%s\t%s\t%s'%(node['taxid'],node['taxonomy'],'\t'.join(map(str,sample_reads)))
            if file==None:
                print(row)
            else:
                file.write('%s\n'%(row))
        
    if 'children' in node:
        for child in node['children']:
            table_treemap(child,samples,stat,file=file)

samples=list(set(sum([[attr for attr in branches[txid]['attrs'] if 'CMS' in attr] for txid in branches],[])))

header='taxid\ttaxonomy\t%s\n'%('\t'.join(sorted(samples)))
table_out=open(os.path.join(base_path,'darkmatter/Table_S3.tsv'),'w')
table_out.write(header)
# table_out=None

table_treemap(taxa_tree,samples,'read_count',table_out)

table_out.close()

print('Done!')

Done!


In [7]:
output_colours=open(os.path.join(base_path,'../figures/fig3/virus_color_scheme.tsv'),'w')

for pol_group in sorted(virus,key=lambda k: (virus[k]['family'],k)):
    family=virus[pol_group]['family']
    
    taxid=virus[pol_group]['taxid'] ## get assigned taxid to virus
    subm_taxid=virus[pol_group]['submission_taxid'] ## get the taxid under which the virus will be submitted (may belong to a group sitting under "unclassified Viruses")
    
    name=virus[pol_group]['provisional_name'] if 'provisional_name' in virus[pol_group] else virus[pol_group]['name'] ## get the official or provisional name of the virus (human-readable)
    
    if taxid!=subm_taxid: ## assigned taxid does not match submission taxid (previously described viruses under incorrect taxid in the database)
        vir_id=subm_taxid ## virus ID is the incorrect taxid
    elif 'provisional_name' in virus[pol_group]: ## new virus, still needs redirect
        vir_id=name
    else: ## taxid matches assigned taxid, not a new virus and sits at the appropriate taxid
        vir_id=taxid
    
    lineage=None
    if vir_id in redirect: ## taxid has been redirected because the official taxid is imprecise
        lineage=ncbi.get_lineage(redirect[vir_id])+[vir_id]
    elif vir_id in high_order_lineage: ## taxid is being adjusted at a higher taxonomic level
        lineage=high_order_lineage[vir_id]
    else:
        lineage=ncbi.get_lineage(vir_id) ## taxid is correct as assigned
    
    for r,rank in enumerate(lineage): ## iterate over lineage recovered
        if rank in high_order_lineage: ## lineage contains high-order readjustment
            lineage=high_order_lineage[rank]+lineage[r+1:] ## insert new lineage
    
    for rank in lineage[::-1]:
        if rank in lineage_colours:
            output_colours.write('%s\t%s\t%s\n'%(pol_group,family,lineage_colours[rank]))
            break
            
output_colours.close()

In [8]:
for b in branches:
    reads=branches[b]['attrs']['read_count']
    if reads>20000:
        print(branches[b]['taxonomy'],reads)

root 114914
cellular organisms 276015
Bacteria 116465
Wolbachia 293165
Spirochaetes 22453
Enterobacterales 71125
Oceanospirillales 26126
Terrabacteria group 47838
Eukaryota 63070
Boreoeutheria 35564
Pecora 23593
Aves 35582
Trypanosomatidae 187775
Leishmaniinae 249832
Microsporidia 28719
Chordata 42036
uncurated 362527
1|Flavi-like 30224
Ūsinis virus 61694
Wuhan mosquito virus 6 113519
Guadeloupe mosquito quaranja-like virus 1 26150
Culex flavivirus 39399
128|Tombus-like 879164
Culex pipiens-associated Tunisia virus 166870
1636|Partiti-like 614765
2|Rhabdo-like 93438
Culex iflavi-like virus 4 602584
24|Ifla-like 40486
25|Ifla-like 51675
296|Reo-like 197759
30|Rhabdo-like 21608
Hubei chryso-like virus 1 21597
Culex bunya-like virus 545058
Culex mosquito virus 4 196077
Marma virus 2611604
Culex mosquito virus 6 92258
Culex narnavirus 1 919082
Guadeloupe mosquito virus 913066
Goras virus 70643
Wenzhou sobemo-like virus 4 533167
Gordis virus 102248
63|Phenui-like 437625
Barstukas virus 1565