# Treemap

----------------

This Jupyter notebook walks through the steps of creating a treemap that can be used to display hierarchical information. As it is now it is designed to show how reads from a metagenomic study are distributed across the different branches of living organisms.

*written by Gytis Dudas, 2019*

## Step 1: process and prepare archived NCBI taxonomy file (optional)

NCBI updated virus taxonomy over the course of this project based on ICTV proposals, but this change has been incomplete _e.g._ most sequences under `ssRNA negative-strand viruses` (descriptive category) have been moved to `Negarnaviricota` (genealogical category) but >100 accessions that are distinctly negative sense RNA viruses still remain under the old category following the evacuation. For the sake of consistency we have decided to go with the older version of taxonomy. 

This cell downloads the required version of taxonomy, unzips it and tar+gzips it, which is the only format `ete3` will take.

In [1]:
# %%bash

# store_folder=/Users/evogytis/Downloads
# cd $store_folder

# # tax_db="taxdmp_2019-01-01" ## latest taxonomy release that still contains original virus taxonomy
# tax_db="taxdmp_2019-12-01" ## latest taxonomy release
# curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/$tax_db.zip ## download taxonomy file

# rm -rf $store_folder/$tax_db ## remove existing category
# mkdir $store_folder/$tax_db

# unzip -o $store_folder/$tax_db.zip -d $store_folder/$tax_db

# cd $store_folder/$tax_db; tar -czvf $store_folder/$tax_db.tar.gz *[dmp,prt,txt]

## Step 2: load libraries, update ete3's taxonomy to file provided

This cell loads three native libraries (`os`, `json`, and `glob`) and `ete3`. `ete3` is required to place all BLAST hits in the treemap via the `get_lineage` command. Updating the taxonomy database takes ~3 minutes.

Also loaded is a CD-HIT file of clustered contigs, and file mapping numbers of reads to each contig.

In [2]:
import ete3
import os,json,glob

# taxonomy_path='/Users/evogytis/Downloads/taxdmp_2019-01-01.tar.gz'
taxonomy_path='/Users/evogytis/Downloads/taxdmp_2019-12-01.tar.gz'
base_path='/Users/evogytis/Documents/manuscripts/skeeters/data' ## point to where the data folder to the repo is locally


ncbi=ete3.ncbi_taxonomy.NCBITaxa()
# ncbi.update_taxonomy_database(taxdump_file=taxonomy_path) ## uncomment to update ete3's taxonomy

#########
## Load read counts of each contig
#########
contig_info={}

for line in open(os.path.join(base_path,'s3/contig_quality_concat/contig_calls_decontam.tsv'),'r'): ## iterate over lines
    l=line.strip('\n').split('\t')
    if l[0]=='sample':
        header={x:i for i,x in enumerate(l)} ## create header dict
    else:
        sample=l[header['sample']]
        contig_name=l[header['contig_name']] ## get contig name
        read_count=int(float(l[header['read_count']])) ## get read count

#         if sample not in contig_reads:
#             contig_reads[sample]={}
        if sample not in contig_info:
            contig_info[sample]={}
            
        contig_info[sample][contig_name]={x:l[header[x]] for x in header}


## Step 3: load pre-defined taxa that will be displayed in the treemap
-----

The following cell loads a file (`displayed_taxa_reads.json`) that looks like this:

```
[{"taxid": 1,
"taxonomy": "root"},
{"taxid": 10239,
"taxonomy": "Viruses"},
{"taxid": 131567,
"taxonomy": "cellular organisms"},
{"taxid": 2157,
"taxonomy": "Archaea"},
{"taxid": 2759,
"taxonomy": "Eukaryota"},
{"taxid": 2,
"taxonomy": "Bacteria"}]
```

It is loaded as a flat list of branches, which get annotated later (here with contig and read counts) and in the last steps built into a nested tree data structure (_i.e._ here it would be `(Root(Viruses,CellularOrganisms(Eukaryota,Bacteria,Archaea)));`).

By default the produced treemap will only contain the taxids listed in this file and anything higher _e.g._ _Metazoa_ would be traversed back to a taxid that does exist in the file which would be _Eukaryota_.

At the end of the cell additional branches are added to the tree. These are high-order taxonomic lineages that are currently not listed by NCBI (_e.g._ the Narna-Levi supergroup that links _Narnaviridae_ and _Leviviridae_), additionally linked into an even higher order structure based on strandedness (_i.e._ Baltimore class) which can be paraphyletic but is done here to introduce more structure into the treemap.

In [3]:
J=json.load(open(os.path.join(base_path,'../treemap/displayed_taxa_reads.json'),'r')) ## load designated treemap branches
branches={b['taxid']:b for b in J} ## flat list of branches indexed by taxid

remove_branches=[] ## empty list that will contain taxids whose lineage cannot be recovered by ete3 (because of out-dated taxonomy)
for taxid in branches: ## iterate over branches loaded so far
    try:
        ncbi.get_lineage(taxid) ## attempt to get lineage
    except ValueError: ## attempt failed
        remove_branches.append(taxid) ## remember taxid for removal later
        print('taxid %s not in taxdump.tar.gz file loaded earlier, it will be excluded'%(taxid))
        
for taxid in remove_branches:
    branches.pop(taxid) ## remove taxids that failed
        
branches['uncurated']={'taxonomy':'uncurated','taxid':'uncurated','attrs':{'colour': '#E7E7E6'}} ## also create a branch that could contain no hits (i.e. total number of queries minus number of queries that hit something)

for b in branches: ## assign default colour to branches
    branches[b]['attrs']={'colour':'#E7E7E6'} ## default colour is slategrey, but later cell assigns a colour based on descent from Bacteria, Eukaryotes or Viruses
    
print('current taxids that exist as branches: %s\n'%(branches.keys()))

###########
## This code will add additional treemap compartments that group viral families into higher-order structures which don't exist in official NCBI taxonomy
###########
high_order_insert={2501952: 'Mononega-Chu', ## Chu
                   11157: 'Mononega-Chu', ## Mononega
                   249184: 'Hepe-Virga', ## Tymo
                   675071: 'Hepe-Virga', ## Virga
                   2560063: 'Narna-Levi', ## Botourmia
                   186766: 'Narna-Levi', ## Narna
                   11989: 'Narna-Levi', ## Levi
                   119163: 'Luteo-Sobemo', ## Luteo
                   2169577: 'Luteo-Sobemo', ## Solemo
                   39738: 'Tombus-Noda', ## Tombus
                   11006: 'Toti-Chryso', ## Toti
                   249310: 'Toti-Chryso', ## Chryso
                   11050: '(+)ssRNA viruses', ## Flavi
                   464095: '(+)ssRNA viruses', ## Picorna
                   10880: 'dsRNA viruses', ## Reo
                   11012: 'dsRNA viruses', ## Partiti
                   'Mononega-Chu': 2497569, ## comes back to (-)ssRNA viruses which exists as a taxid
                   'Toti-Chryso': 'dsRNA viruses', 
                   'Luteo-Sobemo': '(+)ssRNA viruses', 
                   'Tombus-Noda': '(+)ssRNA viruses', 
                   'Narna-Levi': '(+)ssRNA viruses', 
                   'Hepe-Virga': '(+)ssRNA viruses'} ## this dictionary will be used to check taxids for insertion of a higher level "taxid"

for insert in list(high_order_insert.keys())+list(high_order_insert.values()): ## combine all redirects and their destinations
    try:
        taxonomy=ncbi.get_taxid_translator([insert])[insert] ## get actual name if taxid
    except:
        taxonomy=insert
        
    if insert not in branches: ## insert branch if it hasn't been done in earlier loops
        branches[insert]={'attrs':{'colour': '#E7E7E6'},'taxid': insert, 'taxonomy': taxonomy}


current taxids that exist as branches: dict_keys([1, 131567, 2, 953, 1236, 28211, 203691, 91347, 135619, 1783272, 4751, 2759, 1437010, 35500, 33554, 9989, 9979, 8782, 9126, 5654, 1286322, 5690, 1206794, 33213, 6029, 451864, 6231, 33090, 33154, 7711, 10239, 2497569, 11157, 2501952, 11308, 1980410, 1980417, 1980418, 1980416, 1299308, 11270, 2501985, 186766, 11989, 2560063, 2169577, 119163, 11012, 11050, 39738, 10880, 11006, 249310, 464095, 699189, 232795, 249184, 675071, 'uncurated'])



## Step 4: annotation of branches
------

In this cell we add the information about how many reads (and contigs) belong to each taxonomic compartment. The `summarise` option allows you to choose between relying on the treemap compartments specified in the `displayed_taxa_reads.json` file or creating a new treemap compartment for every new taxid seen in the BLAST results summary file.

If `backbone` is chosen every taxid in BLAST output summary is traversed back to the root and all reads get assigned to the first treemap compartment available, _e.g._ if BLAST hit was _Metazoa_ all the reads mapping to the contig would be assigned to _Eukaryota_ which is the highest treemap compartment available.

If `all` is chosen a contig that's been assigned to _Metazoa_ will create a new treemap compartment with all the reads from the contig assigned to it. Future contigs assigned to _Metazoa_ will contribute their reads to this compartment.

In [4]:
summarise='backbone'
# summarise='all'

in_json=open(os.path.join(base_path,'darkmatter/virus.json'),'r')
virus=json.load(in_json)

redirect={} ## will redirect some viruses from the taxids they were submitted under (usually under "unclassified Viruses") to something more informative

for sample in contig_info:
    for contig in contig_info[sample]:
        name=contig_info[sample][contig]['name'] ## get name (only exists for curated things)
        taxid=int(float(contig_info[sample][contig]['taxid'])) ## get taxid
        read_count=int(float(contig_info[sample][contig]['read_count'])) ## get read count
        
        if contig_info[sample][contig]['curated']=='True': ## curated virus
            pol_group=str(int(float(contig_info[sample][contig]['poly_group']))) ## get RdRp group
            correct_taxid=virus[pol_group]['taxid'] ## get curated taxid
            subm_taxid=virus[pol_group]['submission_taxid'] ## get taxid that will be used for submission
            
            if name=='TBD': ## new virus
                vir_id=contig_info[sample][contig]['provisional_name'] ## id is its provisional name
                redirect[vir_id]=correct_taxid ## add taxid for redirection later
                name=vir_id
                
            elif name!='': ## virus name not new, but check if taxid is good
                if correct_taxid!=subm_taxid: ## submission taxid won't be specific enough
                    redirect[subm_taxid]=correct_taxid ## add taxid for redirection later
                    vir_id=subm_taxid ## virus ID is its (incorrect/imprecise) taxid
                else: ## taxid is specific and correct, id is taxid
                    vir_id=taxid
            
            if vir_id not in branches:
                branches[vir_id]={'attrs':{'colour': '#E7E7E6'}} ## create a new branch for treemap
                branches[vir_id]['taxonomy']=name ## give it a name
                branches[vir_id]['taxid']=vir_id ## give whatever was closest to a taxid (provisional name for new viruses)

            if sample not in branches[vir_id]['attrs']:
                branches[vir_id]['attrs'][sample]={'read_count':0} ## sample not encountered yet
                
            branches[vir_id]['attrs'][sample]['read_count']+=read_count ## assign reads to virus
        
        else: ## not a curated contig
            try:
                lineage=ncbi.get_lineage(taxid) ## get lineage of taxid

                if summarise=='all': ## adding every taxid
                    if taxid not in branches:
                        branches[taxid]={'attrs':{'colour': '#E7E7E6'}} ## create a new branch for treemap
                        branches[taxid]['taxonomy']=ncbi.get_taxid_translator([taxid])[taxid] ## give it a name
                        branches[taxid]['taxid']=taxid ## assign taxid

                if 10239 in lineage: ## viral lineage
                    if sample not in branches['uncurated']['attrs']:
                        branches['uncurated']['attrs'][sample]={'read_count': 0}
                    branches['uncurated']['attrs'][sample]['read_count']+=read_count ## assign reads to uncurated part
                else: ## nonviral lineage
                    for rank in lineage[::-1]: ## iterate over lineage backwards (heading towards root)
                        if rank in branches: ## rank exists as a branch in treemap
                            branch=branches[rank] ## grab branch
                            if sample not in branch['attrs']: ## sample not encountered before
                                branch['attrs'][sample]={'read_count': 0}
                            branch['attrs'][sample]['read_count']+=read_count ## assign reads
                            break
            except ValueError:
                print('taxid %s not available'%(taxid))



## Step 5: build tree structure, output to file
-------

All the branches have been annotated up to this stage and now it's time to build the tree data structure. We start by summarising the contig and read counts across samples. We then specify colours for particular compartment and their descendants (_e.g._ -ssRNA viruses obviously need to be red-ish). The procedure involves fetching branch A, retrieving its lineage and looking for taxids along its lineage which exist as a treemap compartment (let's call it B). Treemap compartment A gets assigned as a child of treemap compartment B.

The resulting tree structure is saved to file as a JSON and can be inspected using javascript code included in the repository by running a local server with the command and opening `treemap.html`:
```
python3 -m http.server 4000
```

In [5]:
taxa_tree={'children':[],'attrs':{'colour':'#E7E7E6'}} ## tree structure (+ add no-hit branch from the beginning)

sorted_taxids=sorted(branches,key=lambda k: len(ncbi.get_lineage(k)) if isinstance(k,int) else 999)

for b in sorted_taxids: ## iterate through flat list of branches
    reads=[branches[b]['attrs'][c]['read_count'] for c in branches[b]['attrs'] if 'CMS' in c]
#     contigs=[branches[b]['attrs'][c]['contig_count'] for c in branches[b]['attrs'] if 'CMS' in c]
    branches[b]['attrs']['read_count']=sum(reads)
#     branches[b]['attrs']['contig_count']=sum(contigs) ## compute sum of contig counts across samples  
    branches[b]['attrs']['sample_count']=len(reads)

## Colours assigned to a treemap compartment (all descendants inherit parental colour unless they have their own assigned colour)
lineage_colours={'dsRNA viruses': 'grey', ## dsRNA
                 '(+)ssRNA viruses': 'grey',  ## (+)ssRNA
                 2497569: 'grey', ## (-)ssRNA
                 10239: 'grey', ## viruses
                 'Mononega-Chu': '#FF7D48', ## Mononegas+Chus
#                  11157: '#FF7D48', ## Mononegas
#                  2501952: '#9E415F', ## Chu
                 1980410: '#AC6569', ## Bunyas
                 11308: '#C43A3F', ## Orthomyxos
                 11050: '#0074C4', ## Flavi 
                 464095: '#42C754', ## Picorna
                 11012: '#965DAE', ## Partiti
                 10880: '#866E8B', ## Reo
                 'Hepe-Virga': '#AFD1FF', ## Hepe-Virga
                 'Narna-Levi': '#6967BC', ## Narna-Levi
                 'Luteo-Sobemo': '#E6C930', ## Luteo-Sobemo
                 'Toti-Chryso': '#D8A5CE', ## Toti-Chryso
                 'Tombus-Noda': '#618F80', ## Tombus-Noda
                 'uncurated': '#ACADAE', ## uncurated viral contigs
                 2: '#867A5F', ## Bacteria
                 953: '#86665F', ## Wolbachia
                 2759: '#E5E4E2', ## Eukaryota
                 5794:'#5E716A', ## apicomplexa
                 7711: '#B2BEB5', ## chordata
                 8782: '#6082B6', ## Aves
                 9347: '#658EA9', ## Eutheria
                 33090: '#2e8b57', ## Viridiplantae
                 4751: '#808080', ## Fungi
                 5654: '#708090' ## Trypanosomatidae
                }

taxid_lineages={} ## will contain new lineages
for taxid in sorted_taxids: ## iterate over taxids
    if taxid in redirect: ## taxid has been redirected because the official taxid is imprecise (or this will create a new compartment for completely new things)
        lineage=ncbi.get_lineage(redirect[taxid])+[taxid]
    elif taxid=='uncurated': ## uncurated viral hit (not associated with RdRp or inexplicably fragmented viral sequences)
        lineage=[1,10239,'uncurated']
    elif taxid in high_order_insert or taxid in high_order_insert.values(): ## taxid will have more branches inserted because of higher level redirection
        lineage=[1,10239,taxid]
    else:
        lineage=ncbi.get_lineage(taxid) ## taxid is correct as assigned
        
    adjusted_lineage=list(lineage) ## copy lineage
    
    continue_insertion=True ## assume inserting taxids is required
    while continue_insertion:
        left_to_insert=len([q for q in adjusted_lineage if q in high_order_insert and high_order_insert[q] not in adjusted_lineage]) ## count how many taxids in current lineage will need branches inserted before them
        
        for r,rank in enumerate(adjusted_lineage): ## iterate over ranks in lineage
            if rank in high_order_insert and high_order_insert[rank] not in adjusted_lineage: ## rank is set for insertion
                adjusted_lineage.insert(r,high_order_insert[rank]) ## insert new parental rank
                
        if left_to_insert==0: ## nothing left to insert, cease loop
            continue_insertion=False ## terminate
        
    taxid_lineages[taxid]=adjusted_lineage ## assign new, adjusted lineage

    
for taxid in sorted_taxids: ## iterate over every taxon in treemap
    taxon=branches[taxid] ## fetch branch
    lineage=taxid_lineages[taxid] ## get lineage for taxid
    
    for rank in lineage: ## iterate over (potentially) new lineage
        if rank in branches and rank in lineage_colours: ## lineage exists and has colour assignment
            taxon['attrs']['colour']=lineage_colours[rank] ## assign colour

    if len(lineage)==1 and taxon not in taxa_tree['children']: ## if root
        taxa_tree['children'].append(taxon) ## add root to tree

    for lin in lineage[::-1][1:]: ## iterate through lineage, starting from most recent, ignore first entry (self)
        if lin in branches: ## rank present amongst branches
            parent=branches[lin] ## grab parent
            
            if 'children' not in parent: ## if parent doesn't have children yet - add the attribute
                parent['children']=[]

            if taxon not in parent['children']: ## branch wasn't assigned to its parent yet
                parent['children'].append(taxon) ## add child to parent

            assert taxon['taxid']!=parent['taxid'], 'parent is child %s %s, lineage: %s'%(taxon['taxid'],taxon['taxonomy'],lineage)
            break ## break for loop, parent has been identified
    
json.dump(taxa_tree,open(os.path.join(base_path,'../treemap/skeeters.json'),'w'),indent=1,sort_keys=True) ## write json out to repo

print('Done!')

Done!


In [6]:
def table_treemap(node,samples,stat,file=None):
    if 'taxid' in node:
        sample_reads=[node['attrs'][s][stat] if s in node['attrs'] else 0 for s in samples]
        if sum(sample_reads)>0:
            row='%s\t%s\t%s'%(node['taxid'],node['taxonomy'],'\t'.join(map(str,sample_reads)))
            if file==None:
                print(row)
            else:
                file.write('%s\n'%(row))
        
    if 'children' in node:
        for child in node['children']:
            table_treemap(child,samples,stat,file=file)

samples=sorted(list(set(sum([[attr for attr in branches[txid]['attrs'] if 'CMS' in attr] for txid in branches],[]))))

header='taxid\ttaxonomy\t%s\n'%('\t'.join(sorted(samples)))
table_out=open(os.path.join(base_path,'darkmatter/TableSX_treemap.tsv'),'w')
table_out.write(header)
# table_out=None

table_treemap(taxa_tree,samples,'read_count',table_out)

table_out.close()

print('Done!')

Done!


In [7]:
output_colours=open(os.path.join(base_path,'../figures/fig3/virus_color_scheme.tsv'),'w')

for pol_group in sorted(virus,key=lambda k: (virus[k]['family'],k)):
    family=virus[pol_group]['family']
    
    taxid=virus[pol_group]['taxid'] ## get assigned taxid to virus
    subm_taxid=virus[pol_group]['submission_taxid'] ## get the taxid under which the virus will be submitted (may belong to a group sitting under "unclassified Viruses")
    
    name=virus[pol_group]['provisional_name'] if 'provisional_name' in virus[pol_group] else virus[pol_group]['name'] ## get the official or provisional name of the virus (human-readable)
    
    if taxid!=subm_taxid: ## assigned taxid does not match submission taxid (previously described viruses under incorrect taxid in the database)
        vir_id=subm_taxid ## virus ID is the incorrect taxid
    elif 'provisional_name' in virus[pol_group]: ## new virus, still needs redirect
        vir_id=name
    else: ## taxid matches assigned taxid, not a new virus and sits at the appropriate taxid
        vir_id=taxid
    
    lineage=taxid_lineages[vir_id]
    
    for rank in lineage[::-1]:
        if rank in lineage_colours:
            output_colours.write('%s\t%s\t%s\n'%(pol_group,family,lineage_colours[rank]))
            break
            
output_colours.close()

In [8]:
for b in branches:
    reads=branches[b]['attrs']['read_count']
    if reads>20000:
        print(branches[b]['taxonomy'],reads)

root 215117
cellular organisms 338321
Bacteria 123177
Wolbachia 295760
Spirochaetes 106937
Enterobacterales 71438
Oceanospirillales 25934
Terrabacteria group 48446
Eukaryota 30950
Boreoeutheria 36940
Aves 38748
Trypanosomatidae 186353
Leishmaniinae 249890
Ecdysozoa 30871
Microsporidia 56145
Dikarya 20896
Viridiplantae 61664
Opisthokonta 25604
Chordata 27300
uncurated 374696
Hubei mosquito virus 4 879164
76|Phasma-like 205279
296|Reo-like 197759
Hubei virga-like virus 2 342734
Culex iflavi-like virus 4 602584
Marma virus 2612944
Culex bunyavirus 2 545058
Culex narnavirus 1 919082
1|Flavi-like 30224
Wuhan mosquito virus 6 113519
Gordis virus 102248
Culex mosquito virus 6 92258
1636|Partiti-like 614765
63|Phenui-like 437625
Culex flavivirus 39399
30|Rhabdo-like 21608
2|Rhabdo-like 93438
Culex mosquito virus 4 196077
24|Ifla-like 40486
Guadeloupe mosquito virus 913066
Aedes aegypti totivirus 56124
Guadeloupe mosquito quaranja-like virus 1 26150
25|Ifla-like 51675
Kellev virus 70643
Hubei c