## getMito
- GitHub page: https://github.com/shenjean/getMito
- Wiki page: https://github.com/shenjean/getMito/wiki

In [7]:
# This python script creates a dictionary from the nt.list file
# The key is the NCBI accession number and the value is the gene description
# This dictionary is then saved as a pickle named nt.pickle

#!/usr/bin/python
import pickle
ntdict = {'Accession':'Gene description'}

with open("nt.list",'r') as f:
    for line in f:
        line = line.rstrip()
        entry = line.split("\t")
        fullacc=entry[0].split(".")
        newentry={fullacc[0]:entry[1]} 
        ntdict.update(newentry)

with open("nt.pickle", 'wb') as handle:
    pickle.dump(ntdict, handle, protocol=pickle.HIGHEST_PROTOCOL)



In [8]:
# This python script loads nt.pickle 
# Then, it matches MitoFish entries with keys (NCBI accession numbers) in the pickle
# It takes ~24 minutes to match 616,999 MitoFish records to 57,377,397 NCBI GenBank records
# cpupercent=101,cput=00:22:53,mem=34500088kb,ncpus=24,vmem=34845840kb,walltime=00:23:34

#!/usr/bin/python

import pickle

import sys
import os.path
from os import path

with open("nt.pickle", 'rb') as handle:
    NCBI = pickle.load(handle)

outfile="mitofish.genes"
nohit=0
count=0
length=(len(NCBI))

if path.exists(outfile) :
    sys.exit("Error: Output file exists! Please rename output file and try again!")

output=open(outfile,'a')

print("==== Searching... ====")

with open("mitofish.accession",'r') as infile:
    for inline in infile:
        inline = inline.rstrip()
        count +=1
        if inline in NCBI:
            output.write("%s\t%s\n" % (inline,NCBI[inline]))
        elif inline not in NCBI:
            output.write("%s\tNo hit found!\n" % inline)
            print("%s\tNo hit found!" % inline)
            nohit +=1
            
output.close()

print ("==== Run complete! ===")
print ("Total: %d accession numbers" % count)
print ("No hits for %d input accession numbers!" % nohit)


==== Searching... ====
LN610214	No hit found!
LN610215	No hit found!
LN610217	No hit found!
LN610218	No hit found!
LN610219	No hit found!
LN610241	No hit found!
LN610242	No hit found!
LN610243	No hit found!
LN610244	No hit found!
==== Run complete! ===
Total: 16 accession numbers
No hits for 9 input accession numbers!


### Pick up "missing" accession numbers 

In NCBI fasta files, records with duplicate sequences will be concatenated in the header line, e.g.

```
>LN610233.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68321_34LN610234.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68330_55
```
These will not be picked up by the python script above. First, let's extract the accession numbers with no hits into an output file named <b>nohit.accession</b>.
```
grep "No hit" mitofish.genes | awk -F "\t" '{print $1}' | sed "s/$/.[0-9]/g" >nohit.accession
```
We added some regular expression patterns to the accession numbers with no hits so that the grep search is more specific. The output file <b>nohit.accession</b> looks like this:

```
LN610210.[0-9]
LN610216.[0-9]
LN610224.[0-9]
LN610229.[0-9]
LN610230.[0-9]
LN610231.[0-9]
LN610232.[0-9]
LN610234.[0-9]
LN610235.[0-9]
```
We will also extract the hits from the python search for later use. Output file is <b>hit.list</b>:
```
grep -v "No hit" mitofish.genes >hit.list
```

Next, split up the reference file (<b>nt.genenames</b> generated in previous step) into smaller chunks of 1 million lines each. Each split file will have the prefix x followed by a number e.g. x00, x01

```
cd NCBI
split -d -l 1000000 nt.genenames
cd ..
```

For each accession number with no hit, search each chunk of nt.genenames for matches using all processors (-P 0). -n 1 specifies the number of argument to pass per command line. This takes ~39 hours to run for 144,308 accession numbers.

```
for id in $(cat nohit.accession)
do
string=`ls NCBI/x* | xargs -n 1 -P 0 grep $id`
echo "$id%$string" >>nohit.genes
done
```
Output file is <b>nohit.genes</b>. First, let's clean up the output file by removing non-specific matches and removing the regular expression patterns following the accession numbers:

```
grep -P '\t' nohit.genes | sed "s/.\[0\-9\]//g" >nohit.genes.clean
```

Now, combine <b>hit.list</b> (accession numbers with exact matches with NCBI accession numbers) with <b>nohit.genes.clean</b> (accession numbers with duplicated sequences). The output file will be <b>mitofish.hit.list</b>:
```
cat hit.list nohit.genes.clean >mitofish.hit.list
```


In [73]:
# From a user-provided list of genera/species/subspecies, this script extracts the corresponding GenBank accession numbers 
# and gene names of their 12S rRNA sequences or mitochondrial sequences, if available.

# This script is interactive and takes 3 inputs, in the following order:
# 1) Input file name with extension (e.g. input.txt)
# 2) Output file prefix for output files: <prefix>_genus.hits.tsv, <prefix>_species.hits.tsv, <prefix>_exact.hits.tsv
# 3) Reference database - either 12S.ref.tsv or mitofish.ref.tsv



import sys
import os.path
from os import path

# Get the 3 inputs from user

input_file = input()
ref=tuple(open(input_file,'r'))
output_prefix = input()
reference_file = input()



# Throw an error message and exit if output file(s) already exist

full_path=str(output_prefix+"_subspecies.hits.tsv")
species_path=str(output_prefix+"_species.hits.tsv")
genus_path=str(output_prefix+"_genus.hits.tsv")


if path.exists(full_path) or path.exists(species_path) or path.exists(genus_path) :
    sys.exit("Error: Output file exists! Please rename output file and try again!")
    
# This function performs matching at the specified level and writes results to the corresponding output file
# Output files are tab-separated with the following columns:
# Query, taxonomic level, GenBank accession number, gene description

def matchme(query,level):
    count=0
    outpath=str(output_prefix+"_"+level+".hits.tsv")
    output=open(outpath,'a')
    with open(reference_file, 'r') as f:
        for line in f.readlines():
            if query in line:
                    count += 1
                    output.write("%s\t%s\t%s" % (query,level,line))
    output.close()
    print("Query:%s\tLevel:%s\t# Hits:%d" % (query,level,count))
    return;

# The while loop below goes through the input file line by line 
i = 0
seen=set()

while (i < len(ref)):
    
    # Split string in query into genus,species, and subspecies (if present)
    taxa=str(ref[i]).rsplit()
    fulltaxa=str(ref[i]).rsplit("\n")
    fullquery=str(fulltaxa[0])
    gquery=str(taxa[0])
    
    # Check if species string exist in query
    if (len(taxa)>1):
        squery=str(taxa[0]+" "+taxa[1])
    
    qcount = i+1
    print ("=== Searching user query #%d ===" % qcount)

# These if statements determine the level of matching (subspecies/species/genus) for each UNIQUE query 
    
    if (fullquery==gquery):
        if fullquery not in seen:
            matchme(query=fullquery,level="genus")
            seen.add(fullquery)
        else:
            print("Duplicate warning: Genus %s has already been processed." % fullquery)
        
    elif (fullquery==squery):
        if fullquery not in seen:
            matchme(query=fullquery,level="species")
            seen.add(fullquery)
        else:
            print("Duplicate query: Species %s has already been processed." % fullquery)
        if gquery not in seen:
            matchme(query=gquery,level="genus")
            seen.add(gquery)
        else:
            print("Duplicate query: Genus %s has already been processed." % gquery)
    
    else:
        if fullquery not in seen:
            matchme(query=fullquery,level="subspecies")
            seen.add(fullquery)
        else:
            print("Duplicate query: Species %s has already been processed." % fullquery)
        if squery not in seen:
            matchme(query=squery,level="species")
            seen.add(squery)
        else:
            print("Duplicate query: Species %s has already been processed." % squery)
        if gquery not in seen:
            matchme(query=gquery,level="genus")
            seen.add(gquery)
        else:
            print("Duplicate query: Genus %s has already been processed." % gquery)

    i += 1

print ("==== Run complete! ===")

# Check and report on the types of output files generated 

if path.exists(genus_path): 
    print ("Accession numbers of genus hits and description saved in %s" % genus_path)
else:
    print("No genus detected in input file.")
    
if path.exists(species_path):
    print ("Accession numbers of subspecies hits and description saved in %s" % species_path)
else:
    print("No species detected in input file.")
    
    
if path.exists(full_path):
    print ("Accession numbers of subspecies hits and description saved in %s" % full_path)
else:
    print("No subspecies detected in input file.")




subspecies
pysub
mitofish.ref.tsv
=== Searching user query #1 ===
Query:Histioteuthis celetaria celetaria	Level:subspecies	# Hits:0
Query:Histioteuthis celetaria	Level:species	# Hits:0
Query:Histioteuthis	Level:genus	# Hits:0
=== Searching user query #2 ===
Query:Histioteuthis corona corona	Level:subspecies	# Hits:0
Query:Histioteuthis corona	Level:species	# Hits:0
Duplicate query: Genus Histioteuthis has already been processed.
=== Searching user query #3 ===
Query:Stomias boa boa	Level:subspecies	# Hits:0
Query:Stomias boa	Level:species	# Hits:28
Query:Stomias	Level:genus	# Hits:86
=== Searching user query #4 ===
Query:Lampadena urophaos atlantica	Level:subspecies	# Hits:0
Query:Lampadena urophaos	Level:species	# Hits:13
Query:Lampadena	Level:genus	# Hits:63
=== Searching user query #5 ===
Query:Notoscopelus elongatus kroyeri	Level:subspecies	# Hits:3
Query:Notoscopelus elongatus	Level:species	# Hits:18
Query:Notoscopelus	Level:genus	# Hits:64
=== Searching user query #6 ===
Query:Sc