## getTaxa

This script fetches genus, species, or subspecies belonging to a specified list of taxa category (family, order, class, and phylum). The output can then be used for getMito.

### Usage

### Input file
Tab-separated file with the following fields: input taxa name, input taxonomic level, output taxonomic level, e.g:
```
Bathylagidae	family	species
Onychoteuthidae	family	species
Anguilliformes	order	species
Astronesthes	order	subspecies
Melamphaes	order	subspecies
```

### Output file
Two output files will be produced. 
1. xxx_taxa.tsv is a tab-separated file with the following fields: input taxa name, input taxonomic level, output taxonomic level, output taxonomic names, e.g.: 
2. xxx_taxa.txt is a clean file only containing output taxonomic names. This is compatible with getMito.


### Prepare NCBI lineage data

NCBI Lineage data is prepared using the [NCBItax2lin tool](https://github.com/zyxue/ncbitax2lin)

#### Download NCBI taxdump

```
cd NCBI
wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar zxf taxdump.tar.gz 
```

#### Run ncbitax2lin 

```
ncbitax2lin nodes.dmp names.dmp
gunzip ncbi_lineages_2020-05-28.csv
```

#### Extract only subspecies to phylum information from Eukaryota records and reformat output as tab-separated tsv file

```
cat ncbi_lineages_2020-05-28.csv | grep -e tax_id -e Eukaryota | awk -F "," '{OFS="\t"}{print $3,$4,$5,$6,$7,$8,$51}' >eukaryota.tsv
```


In [8]:
# This script fetches genus, species, or subspecies 
# belonging to a specified list of taxonomic categories (family, order, class, and phylum). 
# The *.txt output can then be used for getMito.py

# This script is interactive and takes 2 inputs, in the following order:
# 1) Input file name with extension (e.g. input.txt)
# 2) Output file prefix for output files: <prefix>_taxa.tsv, <prefix>_taxa.txt

import sys
import os.path
from os import path

# Get the 2 inputs from user

input_file = input()
filein=tuple(open(input_file,'r'))
output_prefix = input()
reference_file='eukaryota.tsv'
i=0
taxlist=["phylum","class","order","family","genus","species","subfamily","subspecies"]


# Throw an error message and exit if output file(s) already exist

tsv=str(output_prefix+"_taxa.tsv")
txt=str(output_prefix+"_taxa.txt")


if path.exists(tsv) or path.exists(txt) :
    sys.exit("Error: Output file exists! Please rename output file and try again!")
    

# For deduplication of txt output
seen=set()
seen.add("")

# This function looks up input taxa category and returns output taxonomic names
def lookup(name,level,searchlevel):
    count=0
    output1=open(tsv,'a')
    output2=open(txt,'a')
    qcount=i+1
    
    with open(reference_file, 'r') as f:
        for rline in f.readlines():
            linelist=rline.rsplit("\t")
            
            if (name.casefold()==linelist[level].casefold()):
                hit=linelist[searchlevel].rstrip("\n")
                
                if (hit!=""):
                    output1.write("%s\t%s\t%s\t%s\n" % (name.capitalize(),taxlist[level],hit,taxlist[searchlevel]))
                    count +=1 

                if (hit not in seen):
                    seen.add(hit)
                    output2.write("%s\n" % hit)    
                    
    output1.close()
    output2.close()
    print("Query #%d:%s\tQuery level:%s\tSearch level:%s\t# hits:%d" % (qcount,name.capitalize(),taxlist[level],taxlist[searchlevel],count))
    return;


while (i < len(filein)):
 
    # Split input string into input taxa name, input taxonomic level, output taxonomic level
    line=str(filein[i]).rsplit("\t")
    intaxa=line[0]
    
    inlevel=0
    if(line[1].casefold()=="Phylum".casefold()):
        inlevel=0
    elif(line[1].casefold()=="Class".casefold()):
        inlevel=1
    elif(line[1].casefold()=="Order".casefold()):
        inlevel=2
    elif(line[1].casefold()=="Family".casefold()):
        inlevel=3
    elif(line[1].casefold()=="Subfamily".casefold()):
        inlevel=6
        
    outlevel=0  
    search=line[2].rsplit("\n")
    if(search[0].casefold()=="Genus".casefold()):
        outlevel=4
    elif(search[0].casefold()=="Species".casefold()):
        outlevel=5
    elif(search[0].casefold()=="Subspecies".casefold()):
        outlevel=7 

    lookup(name=intaxa,level=inlevel,searchlevel=outlevel)
    
    i += 1

taxin
taxout
Query #1:Bathylagidae	Query level:family	Search level:species	# hits:29
Query #2:Onychoteuthidae	Query level:family	Search level:species	# hits:45
Query #3:Anguilliformes	Query level:order	Search level:species	# hits:888
Query #4:Astronesthes	Query level:order	Search level:subspecies	# hits:0
Query #5:Melamphaes	Query level:order	Search level:subspecies	# hits:0
Query #6:Astronesthinae	Query level:subfamily	Search level:genus	# hits:54
Query #7:Taoniinae	Query level:subfamily	Search level:genus	# hits:0
