## Step 1: process and prepare archived NCBI taxonomy file

NCBI updated virus taxonomy over the course of this project based on ICTV proposals, but this change has been incomplete _e.g._ most sequences under `ssRNA negative-strand viruses` (descriptive category) have been moved to `Negarnaviricota` (genealogical category) but >100 accessions remain under the old category. For the sake of consistency we have decided to go with the older version of taxonomy. 

This cell downloads the required version of taxonomy, unzips it and tar+gzips it, which is the only format `ete3` will take.

In [1]:
%%bash

store_folder=/Users/evogytis/Downloads
cd $store_folder

tax_db="taxdmp_2019-01-01" ## latest taxonomy release that still contains original virus taxonomy
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/$tax_db.zip ## download taxonomy file

rm -rf $store_folder/$tax_db ## remove existing category
mkdir $store_folder/$tax_db

unzip -o $store_folder/$tax_db.zip -d $store_folder/$tax_db

cd $store_folder/$tax_db; tar -czvf $store_folder/$tax_db.tar.gz *[dmp,prt,txt]

Archive:  /Users/evogytis/Downloads/taxdmp_2019-01-01.zip
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/citations.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/delnodes.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/division.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/gencode.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/merged.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/names.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/nodes.dmp  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/gc.prt  
  inflating: /Users/evogytis/Downloads/taxdmp_2019-01-01/readme.txt  


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0  0 44.7M    0 44888    0     0  21748      0  0:35:55  0:00:02  0:35:53 21737  4 44.7M    4 2217k    0     0   735k      0  0:01:02  0:00:03  0:00:59  735k 15 44.7M   15 7003k    0     0  1748k      0  0:00:26  0:00:04  0:00:22 1748k 28 44.7M   28 12.5M    0     0  2569k      0  0:00:17  0:00:05  0:00:12 2571k 37 44.7M   37 16.9M    0     0  2883k      0  0:00:15  0:00:06  0:00:09 3513k 46 44.7M   46 20.5M    0     0  3002k      0  0:00:15  0:00:07  0:00:08 4243k 54 44.7M   54 24.1M    0     0  3091k      0  0:00:14  0:00:08  0:00:06 4515k 65 44.7M   65 29.3M    0     0  3334k      0  0:00

## Step 2: load libraries, update ete3's taxonomy to file provided

This cell loads three native libraries (`os`, `json`, and `glob`) and `ete3`. `ete3` is required to place all BLAST hits in the treemap via the `get_lineage` command. Updating the taxonomy database takes ~3 minutes.

In [2]:
import ete3
import os,json,glob

taxonomy_path='/Users/evogytis/Downloads/taxdmp_2019-01-01.tar.gz'

ncbi=ete3.ncbi_taxonomy.NCBITaxa()
ncbi.update_taxonomy_database(taxdump_file=taxonomy_path)

Loading node names...
2040807 names loaded.
204803 synonyms loaded.
Loading nodes...
2040807 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /Users/evogytis/.etetoolkit/taxa.sqlite ...
 2040000 generating entries... enerating entries... generating entries... 1895000 generating entries... 
Uploading to /Users/evogytis/.etetoolkit/taxa.sqlite


Inserting synonyms:      20000 




Inserting taxid merges:      0  




Inserting taxids:       20000  




Inserting taxids:       2040000   




## Step 3: load pre-defined taxa that will be displayed in the treemap
-----

The following cell loads a file (`displayed_taxa_reads.json`) that looks like this:

```
[{"taxid": 1,
"taxonomy": "root"},
{"taxid": 10239,
"taxonomy": "Viruses"},
{"taxid": 131567,
"taxonomy": "cellular organisms"},
{"taxid": 2157,
"taxonomy": "Archaea"},
{"taxid": 2759,
"taxonomy": "Eukaryota"},
{"taxid": 2,
"taxonomy": "Bacteria"}]
```

It is loaded as a flat list of branches, which get annotated (here with contig and read counts) and at the last step built into a nested tree data structure (_i.e._ here it would be `(Root((CellularOrganisms(Eukaryota,Bacteria,Archaea)),Viruses))`).

By default this Jupyter notebook will produce a treemap only displaying these pre-specified taxonomic compartments. There is an option to include all hits, which ends up being a bit crowded when rendered statically, but is ideal for exploration when rendered interactively in javascript. This step also removes any taxids that are not present in the taxonomy database loaded at the beginning.

In [3]:
treemap_path='/Users/evogytis/Documents/manuscripts/skeeters/treemap/'

J=json.load(open(os.path.join(treemap_path,'displayed_taxa_reads.json'),'r')) ## load designated treemap branches
branches={b['taxid']:b for b in J} ## flat list of branches indexed by taxid

remove_branches=[] ## empty list that will contain taxids whose lineage cannot be recovered by ete3 (because of out-dated taxonomy)
for taxid in branches: ## iterate over branches loaded so far
    try:
        ncbi.get_lineage(taxid) ## attempt to get lineage
    except ValueError: ## attempt failed
        remove_branches.append(taxid) ## remember taxid for removal later
        print('taxid %s not in taxdump.tar.gz file loaded earlier'%(taxid))
        
for taxid in remove_branches:
    branches.pop(taxid) ## remove taxids that failed
        
branches['no_hit']={'taxonomy':'no blast hit','taxid':'no_hit'} ## also create a branch that could contain no hits (i.e. total number of queries minus number of queries that hit something)

for b in branches: ## assign default colour to branches
    branches[b]['attrs']={'colour':'slategrey'} ## default colour is slategrey, but later cell assigns a colour based on descent from Bacteria, Eukaryotes or Viruses

## Step 4: annotation of branches
------

In this cell we're loading a mapping of contig IDs to numbers of reads that comprise it. We then iterate over BLAST output files (against NT and NR) that have been annotated with a taxid that the contig could have come with, including uncertainty (_i.e._ higher taxonomic ranks if the contig BLASTs to a lot of taxa). We are only interested in NR hits for viruses and NT hits for non-virus taxa. 

Additionally we redirect a number of taxids that do not exist in the NCBI taxonomy version we are using here to nearest taxa that do exist. We keep track of how many we lose along the way and also try to catch cases where NT hits are non-virus and NR hits are virus, which results in overwriting previous entries. We keep the taxid that has the higher bitscore.

In [4]:
contig_reads={} ## will contain a mapping of sample: contig ID: number of reads

base_path='/Users/evogytis/Dropbox/Jupyter_notebooks/Biohub/California_mosquitoes/s3_bucket/contig_quality/' ## where sample folders sit

for folder in glob.glob(os.path.join(base_path,'*/bowtie_csp_counts_1000.txt')): ## iterate over remapping of reads to contigs
    sample=folder.split(os.path.sep)[-2] ## get sample name
    for line in open(folder,'r'): ## iterate over contig IDs
        l=line.strip('\n').split('\t')
        contig_id,read_count=l ## get contig ID, reads that mapped to it
        read_count=int(read_count) ## read count is integer
        
        if sample not in contig_reads:
            contig_reads[sample]={} ## dict hasn't seen this sample before
        contig_reads[sample][contig_id]=read_count ## map number of reads to contig ID in appropriate sample
        
contig2taxid={} ## maps each contig (per sample) to a taxid
contig_stats={} ## tracks stats of each contig (per sample)
lost=[] ## keep track of taxids that can't be found because of outdated database
lost_contigs={}

for folder in glob.glob(os.path.join(base_path,'*/blast_lca_??_filtered.m9')): ## iterate over blast summaries
    bname=folder.split(os.path.sep)[-2] ## get sample name
    
    for line in open(folder): ## iterate over blast summary
        l=line.strip('\n').split('\t')
        
        if l[0]=='query':
            header={x:i for i,x in enumerate(l)} ## header dict
            
        else: ## not header
            contig_id=l[0] ## get contig ID
            blast_type=l[1] ## nr or nt
            sample=l[2] ## sample name
            taxid=int(l[-1]) ## assigned taxid
            
            assert sample==bname,'Sample names do not match: %s %s %s'%(sample,bname,l) ## check that sample names match
            
            if taxid in [2559587,2585030]: ## redirect taxa that do not exist in the NCBI taxonomy version used here (only going after taxids that end up being hit frequently)
                taxid=10239
            elif taxid==2507131:
                taxid=2169577
            elif taxid==2511993:
                taxid=354276
            elif taxid==2509675:
                taxid=388448
            elif taxid==2500547:
                taxid=1778601
            
            if sample not in contig2taxid: ## sample not seen in contig2taxid
                contig2taxid[sample]={}
                contig_stats[sample]={}
                
            try:
                lineage=ncbi.get_lineage(taxid) ## attempt to get lineage of taxid (throws error if it's not in the current taxonomy version
                proceed=False ## assume invalid combination (virus + nt blast, cellular organism + nr blast)
                
                if (131567 in lineage and blast_type=='nt') or (10239 in lineage and blast_type=='nr'): ## cellular organisms and nt OR virus and nr
                    proceed=True ## good to go
                    
                if proceed: ## good to go
                    write=True ## assume entry can be stored in a dict
                    
                    if contig_id in contig2taxid[sample]: ## contig has been seen and assigned to a taxid previously
                        if float(l[header['bitscore']])<float(contig_stats[sample][contig_id]['bitscore']): ## original bitscore better
                            write=False ## original bitscore was better, don't write this taxid to the dict this loop
                        print('\nContig %s in sample %s has been assigned a taxid previously.'%(contig_id,sample))
                        old_taxid=contig2taxid[sample][contig_id] ## old taxid stored previously
                        new_taxid=taxid ## current taxid
                        translate=ncbi.get_taxid_translator([old_taxid,new_taxid]) ## get dict of taxid names
                        old_name,new_name=translate[old_taxid],translate[new_taxid] ## get taxonomic names
                        old_bitscore=float(contig_stats[sample][contig_id]['bitscore']) ## get stored bitscore
                        new_bitscore=float(l[header['bitscore']]) ## get current bitscore
                        print('%s\noriginal taxid was %s (%s) with bitscore %.2f\nnew taxid is %s (%s) with bitscore %.2f'%('keeping original' if write==False else 'keeping new entry',old_taxid, old_name, old_bitscore, new_taxid, new_name, new_bitscore))
                        
                    if write==True: ## allowed to write this iteration (contig not seen before or new taxid has better bitscore)
                        contig2taxid[sample][contig_id]=taxid ## map contig (within sample) to a taxid
                        contig_stats[sample][contig_id]={x:l[header[x]] for x in header} ## store all of stats for this contig
                
            except ValueError: ## fetching lineage failed
                print('taxid %s not present in taxonomy database used'%(taxid))
                if sample not in lost_contigs: ## sample not seen before amongst lost contigs
                    lost_contigs[sample]=[]
                lost_contigs[sample].append(l)
                lost.append(taxid) ## keep track of taxids that failed lineage fetching


area_values={'contigs':{},'reads':{}} ## will contain counts of contigs and reads

for sample in contig2taxid: ## iterate over samples in mapping of contig IDs to taxids
    if sample not in area_values['contigs']: ## sample not seen before
        area_values['contigs'][sample]={}
        area_values['reads'][sample]={}
        
    for contig_id in contig2taxid[sample]: ## iterate over contigs
        taxid=contig2taxid[sample][contig_id] ## get taxid assigned to contig
        if taxid not in area_values['contigs'][sample]: ## taxid not encountered before
            area_values['contigs'][sample][taxid]=0
            area_values['reads'][sample][taxid]=0
        
        area_values['contigs'][sample][taxid]+=1 ## increment contig count for this taxid
        
        if contig_id not in contig_reads[sample]: ## check if this contig has any number of reads assigned to it (remapping was done on the basis of NT, so if this contig only had a hit in NR there won't be reads mapping to it)
            print('No reads mapping to %s from %s were found'%(contig_id,sample))
        else:
            pass
#             if contig_reads[sample][contig_id]>2: ## optional filering step
            area_values['reads'][sample][taxid]+=contig_reads[sample][contig_id] ## increment number of reads assigned to taxid
        
print('\n\nTaxids without assigned lineage: %s (%s unique)'%(len(lost),len(set(lost)))) ## report back on taxids lost along the way
print('Number of times a taxid was not assigned:',{l:lost.count(l) for l in set(lost)})


Contig NODE_3_length_2417_cov_14.070085 in sample CMS002_026d_Rb_S149_L004 has been assigned a taxid previously.
keeping original
original taxid was 1566308 (unclassified Quaranjavirus) with bitscore 863.20
new taxid is 3635 (Gossypium hirsutum) with bitscore 52.80

Contig NODE_4_length_2425_cov_7.017888 in sample CMS002_045f_Rb_S189_L004 has been assigned a taxid previously.
keeping original
original taxid was 1566308 (unclassified Quaranjavirus) with bitscore 863.20
new taxid is 3635 (Gossypium hirsutum) with bitscore 52.80
taxid 2545764 not present in taxonomy database used

Contig NODE_4_length_2435_cov_10.370653 in sample CMS002_045d_Rb_S186_L004 has been assigned a taxid previously.
keeping original
original taxid was 1566308 (unclassified Quaranjavirus) with bitscore 863.20
new taxid is 3635 (Gossypium hirsutum) with bitscore 52.80

Contig NODE_2_length_2442_cov_5.123890 in sample CMS002_026a_Rb_S146_L004 has been assigned a taxid previously.
keeping original
original taxid was


Contig NODE_55_length_1488_cov_1.956768 in sample CMS001_017_Ra_S6 has been assigned a taxid previously.
keeping original
original taxid was 1980610 (Phasi Charoen-like phasivirus) with bitscore 109.80
new taxid is 45351 (Nematostella vectensis) with bitscore 52.80

Contig NODE_9_length_3219_cov_5.589752 in sample CMS001_035_Ra_S20 has been assigned a taxid previously.
keeping original
original taxid was 699180 (Dinovernavirus) with bitscore 602.10
new taxid is 55661 (Cuculus canorus) with bitscore 52.80
taxid 2511984 not present in taxonomy database used

Contig NODE_2_length_3227_cov_97.069841 in sample CMS001_027_Ra_S16 has been assigned a taxid previously.
keeping original
original taxid was 1923660 (Wenzhou sobemo-like virus 4) with bitscore 784.60
new taxid is 9126 (Passeriformes) with bitscore 141.00

Contig NODE_4_length_2440_cov_83.041050 in sample CMS002_045b_Rb_S184_L004 has been assigned a taxid previously.
keeping original
original taxid was 1566308 (unclassified Quaranja

No reads mapping to NODE_13788_length_245_cov_0.869048 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_14375_length_241_cov_0.890244 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_15021_length_237_cov_0.912500 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_15307_length_235_cov_0.924051 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_15508_length_233_cov_1.775641 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_5512_length_373_cov_71.506757 from CMS001_025_Ra_S7 were found
No reads mapping to NODE_19_length_276_cov_1.100503 from CMS002_0Water3_Rb_S151_L004 were found
No reads mapping to NODE_334_length_279_cov_0.722772 from CMS001_004_Ra_S2 were found
No reads mapping to NODE_337_length_278_cov_1.084577 from CMS001_004_Ra_S2 were found
No reads mapping to NODE_379_length_272_cov_0.748718 from CMS001_004_Ra_S2 were found
No reads mapping to NODE_455_length_264_cov_0.780749 from CMS001_004_Ra_S2 were found
No reads mapping to NODE_469_len

## Step 5: transfer information to treemap branches

This cell transfers the information collected from files in the previous step to branches of the treemap. This is done by fetching the lineage of each BLAST hit (which have assigned taxids) and traversing backwards through that lineage until a rank is found that exists within the treemap.

At this point a choice should be made about what kind of a treemap summary data structure will be generated. The `backbone` option uses the taxonomic ranks provided in the `displayed_taxa_reads.json` file as the final compartments which will be annotated. For example, if the backbone is specified as "Eukaryotes" and "Bacteria" and BLAST hit "Homo sapiens", then the contig and reads counts will be assigned to the "Eukaryotes" compartment. If the `all` option is chosen the "Homo sapiens" hit will be created as a new compartment within "Eukaryotes", and contig and read counts assigned to the new compartment instead of its parent.

In [5]:
summarise='backbone'
# summarise='all'

samples=area_values['reads'].keys()

for sample in samples: ## iterate over samples
    for taxid in area_values['reads'][sample]: ## iterate over taxids
        mrca_path=ncbi.get_lineage(taxid) ## fetch lineage of taxid
        proceed=True ## assume we can proceed
        if 7157 in mrca_path: ## contig assigned to mosquitoes, we do not want it
            proceed=False

        if summarise=='all': ## working in "all" mode, every taxid amongst BLAST hits needs to be added to the treemap
            if taxid not in branches: ## taxid not seen before
                branches[taxid]={'attrs':{sample:{'contig_count':0,'read_count':0} for sample in samples}} ## create empty branch
                branches[taxid]['attrs']['colour']='slategrey' ## give it default colour
                branches[taxid]['taxid']=taxid ## annotate taxid
                branches[taxid]['taxonomy']=ncbi.get_taxid_translator([taxid])[taxid] ## annotate taxonomy
            
        for rank in mrca_path[::-1]: ## iterate over lineage of taxid in reverse order (youngest -> oldest)
            if proceed==True and rank in branches: ## given green light before and rank is present amongst those designated for treemap
                branch=branches[rank] ## fetch branch from treemap
                
                if sample not in branch['attrs']: ## sample not seen before, start contig and read counters
                    branch['attrs'][sample]={'contig_count':0,'read_count':0}
                branch['attrs'][sample]['read_count']+=area_values['reads'][sample][taxid] ## add to contig_count
#                 branch['attrs'][sample]['contig_count']+=area_values['contigs'][sample][taxid]
                break ## no need to iterate further, break loop

reads_missed=0 ## counter for how many reads won't be in the treemap as a result of failure to get taxids with the current version of NCBI taxonomy from previous step
for sample in lost_contigs: ## iterate over samples
    for entry in lost_contigs[sample]: ## iterate over lines that failed
        contig_id=entry[0] ## fetch contig ID
        taxid=entry[-1] ## fetch taxid
        if contig_reads[sample][contig_id]>2: ## can filter by how many missing contigs will be highlighted
            print(entry,contig_reads[sample][contig_id])
        reads_missed+=contig_reads[sample][contig_id] ## increment counter
        
print('Reads missed: %d'%(reads_missed)) ## report

['NODE_5620_length_239_cov_4.500000', 'nt', 'CMS001_042_Ra_S23', '89.617', '183', '17', '2', '51', '232', '4355', '4174', '231.0', '1589719'] 10
['NODE_6070_length_240_cov_0.895706', 'nr', 'CMS001_058_Ra_S9', '62.86', '35', '13', '0', '107', '3', '6', '40', '48.1', '2528593'] 6
['NODE_2624_length_306_cov_1.275109', 'nt', 'CMS001_058_Ra_S9', '92.308', '91', '4', '3', '110', '198', '86356', '86445', '126.0', '2495405'] 4
['NODE_7781_length_404_cov_0.990826', 'nt', 'CMS001_009_Ra_S13', '94.118', '34', '2', '0', '123', '156', '4052175', '4052142', '52.8', '2507935'] 18
['NODE_281_length_889_cov_0.871921', 'nr', 'CMS001_014_Ra_S5', '71.15', '52', '12', '1', '732', '887', '1', '49', '51.2', '2511984'] 22
['NODE_287_length_455_cov_1.029101', 'nr', 'CMS001_015_Ra_S13', '39.55', '134', '63', '6', '108', '455', '1515', '1648', '49.7', '2211644'] 8
['NODE_250_length_1707_cov_0.794479', 'nr', 'CMS001_060_Ra_S12', '67.14', '70', '22', '1', '783', '577', '2', '71', '64.3', '2528593'] 30
['NODE_11730

## Step 6: build tree structure, output to file
-------

All the branches have been annotated up to this stage and now it's time to build the tree data structure. We start by summarising the contig and read counts across samples. We then specify colours for particular compartment and their descendants (_e.g._ plants and their descendants obviously have to be green). We run the same procedure as before - we fetch branch A, retrieve its lineage and look for taxids along its lineage which exist as a branch (let's call it B), we finally assign that branch B as the parent of branch A.

There's an additional redirection step during this process which captures viruses belonging to the Shi 2016 "unclassified" virus category on NCBI. This is a basket rank resulting from a single study and viruses within it have good phylogenetic affinities to existing groups but have not been transferred to them by ICTV. They are redirected manually by providing a new effective taxid, _e.g._ taxid `1923335` (Hubei virga-like virus 2) whose lineage would contain "unclassified RNA viruses ShiM-2016" is instead redirected to `675071` (_Virgaviridae_). The resulting tree-like JSON is output to file.

In [6]:
taxa_tree={'children':[branches['no_hit']],'attrs':{'colour':'slategrey'}} ## tree structure (+ add no-hit branch from the beginning)

sorted_taxids=sorted(branches,key=lambda k: k!='no_hit' and len(ncbi.get_lineage(k))) ## sort taxids by nesting level (high nesting means there are more taxids are passed on the way to root)

for b in sorted_taxids: ## iterate through flat list of branches
    if branches[b]['taxid']!='no_hit':
        reads=[branches[b]['attrs'][c]['read_count'] for c in branches[b]['attrs'] if 'CMS' in c]
        contigs=[branches[b]['attrs'][c]['contig_count'] for c in branches[b]['attrs'] if 'CMS' in c]
        branches[b]['attrs']['read_count']=sum(reads)
        branches[b]['attrs']['contig_count']=sum(contigs) ## compute sum of contig counts across samples  
        
lineage_colours={203691: '#9281A3', ## 203691 ## Spirochaetes
                 1236: '#9370DB', ## medium purple 1236 ## Gammaproteobacteria
                 1783272: '#86608E', ## pomp and power 1783272 ## Terrabacteria group
                 28211: '#7851A9', ## royal purple 28211 ## Alphaproteobacteria
                 8782: '#95C8D8', ## sky blue 8782 ## Aves
                 9347: '#4C516D', ## independence 9347 ## Eutheria
                 3654: '#008081', ## teal 3654 ## Trypanosomatidae
                 4751: '#224C98', ## polynesian blue 4751 ## Fungi
                 5654: '#0093AF', ## munsell blue Trypanosomatidae
                 33090: '#01796F', ## pine 33090 ## Viridiplantae
                 2497569: '#C25F55', ## chilli red, Negarnaviricota
                 35301: '#C25F55', ## chilli red (desaturated to 56) 35301 ## -ssRNA viruses
                 35278: '#EE6785', ## crayola (desaturated to 56)) 35278 ## +ssRNA viruses
                 1922348: '#8D3D4E', ## burgundy (desaturated to 56)) 1922348 ## ShiM-2016 viruses
                 2: '#D8BFD8', ## thistle ## Bacteria
                 2759: '#0087BD', ## ncs blue Eukaryota
                 10239: '#A45A52'} ## redwood Viruses
                ## colours assigned to each compartment of treemap

remove_colours=[]
for taxid in lineage_colours:
    try:
        ncbi.get_lineage(taxid)
    except ValueError:
        remove_colours.append(taxid)
        print('Removing taxid %s because it is not in the taxdump.tar.gz file'%(taxid))
        
for taxid in remove_colours:
    lineage_colours.pop(taxid)
    
shi_lineages={1923025: 11012,
              1923590: 222556,
              1923335: 675071,
              1922928: 39738,
              1922926: 2169577,
              1923660: 2169577,
              1923027: 11012,
              1923298: 11006,
              1923029: 11012,
              1923320: 11006,
              1923664: 39738,
              1923725: 39738,
              1923182: 10880,
              1922855: 249310,
              1922835: 699189,
              1922878: 11012,
              1923251: 10880,
              1923736: 232795,
              1923465: 249310,
              1922777: 186766,
              1923185: 11270,
              1923659: 2169577,
              1923698: 699189} ## taxid redirects for unclassified Shi viruses

# 11012 partiti
# 143920 unclassified Noda
# 675071 virga
# 39738 tombus
# 2169577 solemo
# 39738 tombus
# 699189 ifla
# 10880 reo
# 232795 dicistro
# 249310 chryso

iter_order=sorted(lineage_colours,key=lambda k: -len(ncbi.get_lineage(k))) ## highest level ranks first
    
for taxid in sorted_taxids: ## iterate over every taxon in treemap
    
    taxon=branches[taxid] ## fetch branch
    
    if taxid!='no_hit': ## branch is not the "no hit" category
        lineage=ncbi.get_lineage(taxid) ## get its lineage
        if 1922348 in lineage and taxid!=1922348: ## Shi unclassified virus
            print('Shi virus:',taxid,taxon['taxonomy'])
            lineage=ncbi.get_lineage(shi_lineages[taxid]) ## order returned by get_lineage()
            lineage.append(taxid)
            
        for rank in iter_order: ## iterate through designated taxids, starting with most recent
            if rank in lineage: ## designated rank within lineage of taxon
                taxon['attrs']['colour']=lineage_colours[rank] ## assign colour
                break
        
        if len(lineage)==1 and taxon not in taxa_tree['children']: ## if root
            taxa_tree['children'].append(taxon) ## add root to tree
        
        for lin in lineage[::-1][1:]: ## iterate through lineage, starting from most recent, ignore first entry (self)
            if lin in branches: ## rank present amongst branches
                parent=branches[lin] ## grab parent
                assert lin==parent['taxid'],'lineage with incorrect taxid %s %s'%(parent['taxid'],lin)
                if 'children' not in parent: ## if parent doesn't have children yet - add the attribute
                    parent['children']=[]

                if taxon not in parent['children']: ## branch wasn't assigned to its parent yet
                    parent['children'].append(taxon) ## add child to parent
                
                assert taxon['taxid']!=parent['taxid'], 'parent is child %s %s, lineage: %s'%(taxon['taxid'],taxon['taxonomy'],lineage)
                
                break ## break for loop, parent has been identified

output_path='/Users/evogytis/Documents/manuscripts/skeeters/treemap'
json.dump(taxa_tree,open(os.path.join(output_path,'skeeters.json'),'w'),indent=1) ## write json out to repo


Removing taxid 2497569 because it is not in the taxdump.tar.gz file
Shi virus: 1923335 Hubei virga-like virus 2
Shi virus: 1922926 Hubei mosquito virus 2
Shi virus: 1923660 Wenzhou sobemo-like virus 4
Shi virus: 1922928 Hubei mosquito virus 4
Shi virus: 1923590 Wenzhou noda-like virus 6
