# NCBI Taxonomy
**[Work in progress]**

This notebook downloads the NCBI taxonomy, including the taxonomy id, scientific name, and synonymes.

Bacteria, Invertebrates, Phages, Plants and Fungi, and Synthetic and Chimeric are currently excluded.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7408187/

Data source: [NCBI](https://www.ncbi.nlm.nih.gov)

Author: Peter Rose (pwrose@ucsd.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
from functools import reduce

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
# Path will take care of handling operating system differences.
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


In [4]:
!../../scripts/download.sh

Logging to:  /Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import/logs/2020-12-13


### Import NCBI Taxonomy Names

In [5]:
columns = ['id', 'name', 'nameCategory']

In [6]:
names = pd.read_csv(NEO4J_IMPORT / 'cache/ncbi_taxonomy' / 'names.dmp', sep='\t\|\t', engine='python', 
                    usecols=[0,1,3], names=columns, header=None, dtype='str')

In [7]:
names['nameCategory'] = names['nameCategory'].str.replace('\t\|', '')
names.fillna('', inplace=True)

In [8]:
print('Number of taxonomyIds:', len(names['id'].unique()))

Number of taxonomyIds: 2295525


In [9]:
print('Number of taxonomyIds:', names['nameCategory'].unique())

Number of taxonomyIds: ['synonym' 'scientific name' 'blast name' 'genbank common name' 'in-part'
 'authority' 'type material' 'equivalent name' 'includes' 'common name'
 'genbank synonym' 'acronym' 'genbank acronym']


In [10]:
sci_name = names.query("nameCategory == 'scientific name'").copy()
sci_name.rename(columns={'name': 'scientificName'}, inplace=True)
sci_name = sci_name[['id', 'scientificName']]

In [11]:
print('scientific names:', sci_name.shape[0])

scientific names: 2295525


In [12]:
names1 = names.merge(sci_name, on='id', how='left')

In [13]:
names1.head()

Unnamed: 0,id,name,nameCategory,scientificName
0,1,all,synonym,root
1,1,root,scientific name,root
2,2,Bacteria,scientific name,Bacteria
3,2,bacteria,blast name,Bacteria
4,2,eubacteria,genbank common name,Bacteria


In [14]:
names2 = names1.groupby(['id', 'scientificName'])['name'].apply(list).reset_index(name='synonymes')

In [15]:
names2.head()

Unnamed: 0,id,scientificName,synonymes
0,1,root,"[all, root]"
1,10,Cellvibrio,[Cellvibrio (ex Winogradsky 1929) Blackall et ...
2,100,Ancylobacter aquaticus,[Ancylobacter aquaticus (Orskov 1928) Raj 1983...
3,100000,Herbaspirillum sp. BA12,"[Herbaspirillum sp. BA12, Herbispirillum sp. B..."
4,1000000,Microbacterium sp. 6.11-VPa,[Microbacterium sp. 6.11-VPa]


In [16]:
names2['name'] = names2['scientificName']

In [17]:
names2['id'] = 'taxonomy:' + names2['id']

In [18]:
names2['synonymes'] = names2['synonymes'].apply(lambda x: ';'.join(x))

In [19]:
names2.head(10)

Unnamed: 0,id,scientificName,synonymes,name
0,taxonomy:1,root,all;root,root
1,taxonomy:10,Cellvibrio,Cellvibrio (ex Winogradsky 1929) Blackall et a...,Cellvibrio
2,taxonomy:100,Ancylobacter aquaticus,Ancylobacter aquaticus (Orskov 1928) Raj 1983;...,Ancylobacter aquaticus
3,taxonomy:100000,Herbaspirillum sp. BA12,Herbaspirillum sp. BA12;Herbispirillum sp. BA12,Herbaspirillum sp. BA12
4,taxonomy:1000000,Microbacterium sp. 6.11-VPa,Microbacterium sp. 6.11-VPa,Microbacterium sp. 6.11-VPa
5,taxonomy:1000001,Mycobacterium sp. 1.1-VEs,Mycobacterium sp. 1.1-VEs,Mycobacterium sp. 1.1-VEs
6,taxonomy:1000002,Mycobacterium sp. 1.12-VEs,Mycobacterium sp. 1.12-VEs,Mycobacterium sp. 1.12-VEs
7,taxonomy:1000003,Nocardia sp. 3.2-VPr,Nocardia sp. 3.2-VPr,Nocardia sp. 3.2-VPr
8,taxonomy:1000004,Polaromonas sp. 7.23-VPa,Polaromonas sp. 7.23-VPa,Polaromonas sp. 7.23-VPa
9,taxonomy:1000005,Promicromonospora sp. 10.25-Bb,Promicromonospora sp. 10.25-Bb,Promicromonospora sp. 10.25-Bb


### Import NCBI Taxonomy Nodes

In [20]:
node_columns = ['id', 'parentId', 'rank', 'divisionId']

In [21]:
nodes = pd.read_csv(NEO4J_IMPORT / 'ncbi_taxonomy' / 'nodes.dmp', sep='\t\|\t', engine='python', 
                    usecols=[0,1,2,4], names=node_columns, header=None, dtype='str')

In [22]:
print('Number of relationships:', nodes.shape[0])

Number of relationships: 2295525


In [23]:
nodes.head()

Unnamed: 0,id,parentId,rank,divisionId
0,1,1,no rank,8
1,2,131567,superkingdom,0
2,6,335928,genus,0
3,7,6,species,0
4,9,32199,species,0


In [24]:
division_columns = ['divisionId', 'division']

In [25]:
divisions = pd.read_csv(NEO4J_IMPORT / 'ncbi_taxonomy' / 'division.dmp', sep='\t\|\t', engine='python', 
                    usecols=[0,2], names=division_columns, header=None, dtype='str')

In [26]:
divisions.head(20)

Unnamed: 0,divisionId,division
0,0,Bacteria
1,1,Invertebrates
2,2,Mammals
3,3,Phages
4,4,Plants and Fungi
5,5,Primates
6,6,Rodents
7,7,Synthetic and Chimeric
8,8,Unassigned
9,9,Viruses


In [27]:
nodes = nodes.merge(divisions, on='divisionId', how='left')

In [28]:
nodes.shape

(2295525, 5)

In [29]:
nodes.head()

Unnamed: 0,id,parentId,rank,divisionId,division
0,1,1,no rank,8,Unassigned
1,2,131567,superkingdom,0,Bacteria
2,6,335928,genus,0,Bacteria
3,7,6,species,0,Bacteria
4,9,32199,species,0,Bacteria


##### Restrict taxonomies
Mammals, Primates, Rodents, Viruses, and unclassified environmental samples (taxonomyID: 151659)

In [30]:
nodes = nodes[nodes['divisionId'].isin(['2','5','6','8','9','10']) | (nodes['id'] == '151659')]

In [31]:
nodes['id'] = 'taxonomy:' + nodes['id']
nodes['parentId'] = 'taxonomy:' + nodes['parentId']

In [32]:
nodes.shape

(294338, 5)

In [33]:
nodes = nodes.merge(names2, on='id')

In [34]:
nodes.to_csv(NEO4J_IMPORT / '00b-NCBITaxonomy.csv', index = False)

In [35]:
print('Number of nodes', nodes.shape[0])

Number of nodes 294338


In [36]:
nodes.head()

Unnamed: 0,id,parentId,rank,divisionId,division,scientificName,synonymes,name
0,taxonomy:1,taxonomy:1,no rank,8,Unassigned,root,all;root,root
1,taxonomy:2387,taxonomy:28384,no rank,8,Unassigned,transposons,transposons;Transposon,transposons
2,taxonomy:2388,taxonomy:2387,species,8,Unassigned,Transposon Tn-cam204,Transposon Tn-cam204,Transposon Tn-cam204
3,taxonomy:2389,taxonomy:2387,species,8,Unassigned,Transposon Tn10,Transposon Tn10,Transposon Tn10
4,taxonomy:2390,taxonomy:2387,species,8,Unassigned,Transposon Tn1331,Transposon Tn1331,Transposon Tn1331
