<a href="https://colab.research.google.com/github/biothomme/Linalool/blob/master/ncbi_lumberjack_og.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NCBI Lumberjack OG
###Uppsala, Dec 2019
In the NCBI database genbank it can be very tiring to get your fasta sequences... If genbank was a jungle, this script would be lumberjack, enabling a great overview of the jungle, to get your data - fast and fancy.

Before you start, you should ensure, to aim fasta sequences for <b>one specific</b> (but also multiple) <b>genes</b> of a <b>monophyletic taxon</b>. The taxonomic level to retrieve entries can be set to genus or species.

Like OG Simpson may be more than Homer S. <b>NCBI Lumberjack OG</b> outcompetes the 'old' NCBI Lumberjack 2: It is possible to retrieve sequences for multiple outgroups.

---
---

So once again: This script enables the datamining of the NCBI database for genes in monophyletic groups. Follow the blocks straight forward to gain your result. You can mine x gene entries per Genus/Species of your monophylum and download a folder containing <b>all sequences</b> in a <b>fasta-file</b> together with 2 informative files:
- <b>\*stat.csv:</b> includes all taxIDs (only on level genus/species) and Accession IDs for your gene within the monophylum.
- <b>\*data.csv:</b> corresponds to your fasta file and includes very important data of the sequences
<br> 
<br> 



#New to Colab?
Colab is a project of Google, which allows to share Notebooks with interactive code-commands (e.g. in Python). Best is to log in with a Google account and if necessary press the button on the upper left > Open in Playground <.

Afterwards please follow all paragraphs and execute the blocks with squared brackets on the upper left corner, which look like:

# [ <font color="white">˘</font> ]

 This can be performed by clicking on the field between the brackets. If the block is a form, please fill in the fields first and execute to confirm your input.

<b>To run the script successful please follow each sneaky block, read carefully and only skip fields if declared as <font color="grey">OPTIONAL</font> and share <font color="grey">GREY</font> font color!</b>

#0 Usage
This script works with the help of two powerful instutitions: biopython (https://biopython.org/wiki/Documentation) and through a wonderful backdoor to NCBI databases: the Entrez E-utilities service (https://www.ncbi.nlm.nih.gov/books/NBK25497/). If you use the sequence data for scientific purpose (and not only for fun ;-) ), please cite those sources properly (especially biopython requires a citation of Cock et al., 2009!

---
#1 `ncbi_miner`



At first the biopython (Cock et al., 2009) package needs to be installed. 

**Execute the block!**

In [0]:
!pip install biopython

##1.1 Mandatory arguments for search
The following field is the heart of the lumberjack. Nearly all functions are encoded within.

**Execute the block!**

In [0]:
#@title {display-mode: 'form'}
#@markdown The heart of Lumberjack is the function `ncbi_miner`. 
#@markdown It feeds through the NCBI jungle like the hyperactive caterpillar of a leaf miner. Do not forget to execute!
"""
Created on Tue Nov 26 09:29:22 2019

@author: Thomsn
"""

__author__ = 'thomas m. huber'
__email__ = ['thomas.huber@evobio.eu']

def ncbi_miner(taxon_query,
               gene_name,
               gene_length,
               my_mail,
               outgroups = [],
               coding_sequence = True,
               exclude_query = None,
               my_API_key = None,
               tree_resolution = 'genus',
               length_tolerance_percent = 20,
               upper_tolerance = 2,
               search_limit = 10000,
               entries_per_genus = 1,
               entries_per_tax = 1,
               taxonlist_path = None,
               random_mining = False,
               strict_search = True):
    import os
    import pandas as pd
    from Bio import SeqIO
    from Bio import Entrez
    from datetime import datetime
    from urllib.error import HTTPError
    if random_mining:
        from random import shuffle
    start_time = datetime.now()


    newpath = f'{datetime.now().strftime("%Y_%m_%d")}_{taxon_query}_{gene_name}'
    if not os.path.exists(newpath):
        os.makedirs(newpath)
        print(f'Step 0:\n >> Directory {newpath} was created.')


    gl_lowend = gene_length * (100 - length_tolerance_percent)/100
    gl_highend = gene_length * (100 + upper_tolerance * length_tolerance_percent)/100
    
    tree_resolution = str(tree_resolution).lower()

    tax_columns = ['taxID', 'genus', 'epithet', 'entry_UIDs', 'count', 'outgroup']

    if tree_resolution == 'genus':
        taxon_level = 'Genus'
    elif tree_resolution == 'species':
        taxon_level = 'species'
    else:
        print(f'Mining data with tree resolution on the taxonomic level \'{tree_resolution}\' \
is not possible with this script. Try \'genus\' (default) or \'species\'.')

    if taxonlist_path:
        try:
            old_taxon_list = pd.read_csv(taxonlist_path)
        except FileNotFoundError:
            print(f'It was not possible to find input file {taxonlist_path}, please check \
the path and restart \'ncbi_miner\'.')
            return
        else:
            if all([(any(old_taxon_list.keys() == i )) for i in tax_columns]):
                print(f'The csv-file {taxonlist_path} was loaded. Step 1 will be \
skipped. Follow up steps will use those taxa to search for sequences.')

                Entrez.email = my_mail
                Entrez.api_key = my_API_key
                all_taxaIDs = list(old_taxon_list['taxID'])
        
                progress_criterion = len(all_taxaIDs) // 20
                percentage_factor = 20/100
        
                print(f'Step 2:\n >> Esearch for gene entries of all species in the taxon {taxon_query}. \
This may take time, so keep the internet connection, chill down, drink a tea. \n The progress is ...')
                taxon_list = pd.DataFrame(columns = tax_columns)
                for i, taxon in enumerate(all_taxaIDs):
                    k = i + 1
                    if (k % progress_criterion) == 1:
                        print(f' - {int(i // (progress_criterion*percentage_factor))} % -')
                    else:
                        pass

                    if coding_sequence:
                        query = f'{gene_name}[Gene Name] txid{str(taxon)}'
                    else:
                        query = f'{gene_name} txid{str(taxon)}'
                    if exclude_query:
                        exclude_string = f' NOT {exclude_query}'
                        query = f'{query}{exclude_query}'
                    try:
                        with Entrez.esearch(db='nuccore', term=query, retmax=search_limit, sort='date released') as handle:
                            gene_record = Entrez.read(handle)
                    except HTTPError:
                            print('New Error, but RUN Forrest RUN!')
                    else:
                            count = gene_record['Count']
                            if count == '0':
                                pass
                            else:
                                taxon_list.loc[i] = [taxon,
                                                     old_taxon_list['genus'].iloc[i],
                                                     old_taxon_list['epithet'].iloc[i],
                                                     gene_record['IdList'],
                                                     int(gene_record['Count']),
                                                     old_taxon_list['outgroup'].iloc[i]]
                taxon_list.to_csv(f'{newpath}/{taxon_query}_{gene_name}_stat.csv')
                og_list = taxon_list.iloc[:,:][taxon_list.outgroup]
                print(og_list)
                taxon_list = taxon_list.drop(og_list.index)
                print(taxon_list)
                print(f' >> Done - entry database successfully established. Summary was saved as \
{taxon_query}_{gene_name}_stat.csv')

            ### OLD ###

            else:
                print(f'The input file {taxonlist_path} does not fit with the conditions (header).\
Please change and restart \'ncbi_miner\'.')
                return

    else:
        #### > STEP 1 < ####
    
        Entrez.email = my_mail
        Entrez.api_key = my_API_key

        try:
            with Entrez.esearch(db="taxonomy", term=f'{taxon_query}[orgn]', retmax=search_limit) as handle:
                record = Entrez.read(handle)
        except HTTPError:
            return str('Database error, try later...')
        else:
            all_taxaIDs = record['IdList']
              
                #try:
                 #       with Entrez.esearch(db="taxonomy", term=f'{taxon_query}[orgn]', retmax=search_limit) as handle:
                  #          record = Entrez.read(handle)
                #except HTTPError:
                 #       return str('Database error, try later...')
                #else:
                 #       outgroup_taxaID = pd.Series([record['IdList']])

            print('Step 1:\n >> Done - taxon database successfully established.')
        
    
        #### > STEP 2 < ####
    
        Entrez.email = my_mail
        Entrez.api_key = my_API_key
        
        progress_criterion = len(all_taxaIDs) // 20
        percentage_factor = 20/100
        print(f'Step 2:\n >> Esearch for gene entries of all species in the taxon {taxon_query}. \
This may take time, so keep the internet connection, chill down, drink a tea. \n The progress is ...')
        taxon_list = pd.DataFrame(columns = tax_columns)
        for i, taxon in enumerate(all_taxaIDs, 1):
            if (i % progress_criterion) == 1:
                print(f' - {int(i // (progress_criterion*percentage_factor))} % -')

            try:
                with Entrez.esummary(db="taxonomy", id=taxon, retmax=search_limit) as handle:
                    record = Entrez.read(handle)[0]
            except IndexError:
                print('New Error. Nemas Problemas!')
                break
            except HTTPError:
                print('New Error, but bro, stay seated, we skip it and keep searching...')
            else:
                if record['Rank'] == 'species' and record['Genus'] != '':
                    try:
                        if coding_sequence:
                                query = f'{gene_name}[Gene Name]+txid{str(taxon)}'
                        else:
                                query = f'{gene_name}+txid{str(taxon)}'
                        if exclude_query:
                                exclude_string = f'+NOT+{exclude_query}'
                                query = f'{query}{exclude_query}'
                        with Entrez.esearch(db='nuccore', term=query, retmax=search_limit, sort='pub+date') as handle:
                            gene_record = Entrez.read(handle)
                    except HTTPError:
                        print('New Error! But nothing big to worry about.')
                    else:
                        count = gene_record['Count']
                        if count == '0':
                            pass
                        else:
                            taxon_list.loc[taxon] = [taxon,
                                                 record['Genus'],
                                                 record['Species'],
                                                 gene_record['IdList'],
                                                 int(gene_record['Count']),
                                                 'False']
                else:
                    pass
        og_list = pd.DataFrame(columns = tax_columns)
        if len(outgroups) > 0:
            for og in outgroups:
                taxon = og
                if coding_sequence:
                    query = f'{gene_name}[Gene Name]+{og}[Organism]'
                else:
                    query = f'{gene_name}+{og}[Organism]'
                if exclude_query:
                    exclude_string = f'+NOT+{exclude_query}'
                    query = f'{query}{exclude_query}'
                try:
                    with Entrez.esearch(db='nuccore', term=query, retmax=search_limit, sort='pub+date') as handle:
                          gene_record = Entrez.read(handle)
                except HTTPError:
                    print('New Error! But nothing big to worry about.')
                else:
                    count = gene_record['Count']
                    if count == '0':
                        print('There are no entries of the gene for outgroup {og}.')
                        pass
                    else:
                        og_list.loc[taxon] = [taxon,
                                            f'{taxon}_gen',
                                            f'{taxon}_sp',
                                            gene_record['IdList'],
                                            int(gene_record['Count']),
                                            'True']

        else:
            print('Warning: You have no outgroups!')
        #for taxon in enumerate(all_taxaIDs, 1):
            taxon_list.to_csv(f'{newpath}/{taxon_query}_{gene_name}_stat_ingroup.csv')
        print(f' >> Done - entry database successfully established. Summary (without outgroups) was saved as \
{taxon_query}_{gene_name}_stat_ingroup.csv')
        
    
    #### > STEP 3 < ####

    print(f'Step 3:\n >> Efetch for each species of the given genera with the most entries \
for the given gene. This may take some time as well, I would answer some mails in the meantime ;-) \n The progress is ...')
    data_list = pd.DataFrame(columns = ['taxID',
                                        'accession',
                                        'length',
                                        'date',
                                        'organism',
                                        'reference',
                                        'gene_information',
                                        'sampling_locality',
                                        'outgroup',
                                        'genus',
                                        'epithet'])
    ### NEW ###
    full_taxon_list = pd.concat([taxon_list, og_list])
    if taxon_level == 'Genus':
        genera = full_taxon_list['genus'].unique()
    else:
        genera = full_taxon_list['taxID'].unique()
    ### OLD ###

    progress_criterion = -( -len(genera) // 10)
    percentage_factor = 10/100
    
    
    for j, genus_ID in enumerate(genera):
        if (j % progress_criterion) == 1:
            print(f' - {int(j // (progress_criterion*percentage_factor))} % -')
        else:
            pass
        if taxon_level == 'Genus':
          all_species = full_taxon_list[full_taxon_list['genus'] == genus_ID]
        else:
          all_species = full_taxon_list[full_taxon_list['taxID'] == genus_ID]
        all_species = all_species.sort_values(by=['count'], ascending=False)
        if len(list(all_species['entry_UIDs'])) < entries_per_genus:
            epg = len(list(all_species['entry_UIDs']))
        else:
            epg = entries_per_genus
        for entry in list(range(epg)):
        ### NEW ###
            entry_list = list(all_species['entry_UIDs'])[entry]
            if random_mining:
                shuffle(entry_list)
        ### OLD ###
            Entrez.email = my_mail
            Entrez.api_key = my_API_key
            entry_counter = 0
            for i, acc_id in enumerate(entry_list):
                try:
                    with Entrez.efetch(db="nuccore", id=acc_id, retmode='text', rettype="gb") as handle:
                        record = SeqIO.read(handle, "genbank")
                except HTTPError:
                    print('New Error, but chill down, everything is soft')
                else:
                    if gl_lowend < len(record) < gl_highend:
                        features = record.features
                        filterframe = [x.type == 'gene' for x in features]
                        if any(filterframe) == True:
                            keyword = 'gene'
                        else:
                            filterframe = [x.type != 'source' for x in features]
                            keyword = 'product'
                        gene_features = [x for i, x in enumerate(features) if filterframe[i]==True]
                        try:
                            gene_info = [(x.qualifiers.get(keyword))[0] for x in gene_features]
                        except TypeError:
                            gene_info = ['']
                        if strict_search:
                            if (len(gene_info) == 1 and gene_name.lower() in gene_info[0].lower()):
                                sample_locality = features[0].qualifiers.get('country')
                                try:
                                    referenza = record.annotations['references'][0]
                                except TypeError:
                                    referenza = ['']
                                selection = full_taxon_list[full_taxon_list['taxID'] == genus_ID]
                                data_list.loc[f'{j+1}_{record.id}'] = [genus_ID,
                                                    record.id,
                                                    len(record),
                                                    record.annotations['date'],
                                                    record.annotations['organism'],
                                                    referenza,
                                                    gene_info,
                                                    sample_locality,
                                                    selection['outgroup'].iat[0],
                                                    selection['genus'].iat[0],
                                                    selection['epithet'].iat[0]]
                                with open(f'{newpath}/{taxon_query}_{gene_name}.fasta', 'a') as finalfasta:
                                    rawseq = str(record.seq)
                                    if taxon_level == 'species':
                                        fasta_head = str(record.annotations['organism'])
                                        fasta_head = fasta_head.replace(' ', '_').replace('.', '').replace('-', '_')
                                    else:
                                        fasta_head = genus_ID
                                    if entries_per_tax == 1 and entries_per_genus == 1:
                                        finalfasta.write(f'>{fasta_head}\n')
                                    else:
                                        finalfasta.write(f'>{fasta_head}_{record.id}\n')
                                    finalfasta.write(f'{rawseq}\n\n')
                        ### NEW ###
                                entry_counter += 1
                                if entry_counter == entries_per_tax:
                                    break
                                else:
                                    pass
                            else: 
                                pass
                        else:
                            sample_locality = features[0].qualifiers.get('country')
                            try:
                                referenza = record.annotations['references'][0]
                            except TypeError:
                                referenza = ['']
                            selection = full_taxon_list[full_taxon_list['taxID'] == genus_ID]
                            data_list.loc[f'{j+1}_{record.id}'] = [genus_ID,
                                                record.id,
                                                len(record),
                                                record.annotations['date'],
                                                record.annotations['organism'],
                                                referenza,
                                                gene_info,
                                                sample_locality,
                                                selection['outgroup'].iat[0],
                                                selection['genus'].iat[0],
                                                selection['epithet'].iat[0]]
                            with open(f'{newpath}/{taxon_query}_{gene_name}.fasta', 'a') as finalfasta:
                                rawseq = str(record.seq)
                                if taxon_level == 'species':
                                    fasta_head = str(record.annotations['organism'])
                                    fasta_head = fasta_head.replace(' ', '_').replace('.', '').replace('-', '_')
                                else:
                                    fasta_head = genus_ID
                                if entries_per_tax == 1 and entries_per_genus == 1:
                                    finalfasta.write(f'>{fasta_head}\n')
                                else:
                                    finalfasta.write(f'>{fasta_head}_{record.id}\n')
                                finalfasta.write(f'{rawseq}\n\n')
                    ### NEW ###
                            entry_counter += 1
                            if entry_counter == entries_per_tax:
                                break
                            else:
                                pass
                ### OLD ###
                    else:
                        pass
    print('Updating the outgroup data...')
    out_selection = data_list[data_list['outgroup'] == 'True']
    for i, org in enumerate(out_selection['organism']):
        try:
            with Entrez.esearch(db="taxonomy", term=f'{org}[Scientific Name]', retmax=search_limit) as handle:
                record = Entrez.read(handle)
        except HTTPError:
            print(f'Error, did not find {org}.')
        else:
            txid = record['IdList'][0]
            try:
                with Entrez.esummary(db="taxonomy", id=txid, retmax=search_limit) as handle:
                    spef_record = Entrez.read(handle)[0]
            except IndexError:
                print('New Error. Nemas Problemas!')
                break
            except HTTPError:
                print('Errore furore, no problemo spaghetto!')
            else:
                og_gen = spef_record['Genus']
                og_sp = spef_record['Species']
                data_list.loc[data_list['organism'] == org, ['taxID']] = txid
                data_list.loc[data_list['organism'] == org, ['genus']] = og_gen
                data_list.loc[data_list['organism'] == org, ['epithet']] = og_sp
                taxon_list.loc[txid] = [txid,
                                  og_gen,
                                  og_sp,
                                  og_list['entry_UIDs'][i],
                                  og_list['count'][i],
                                  'True']


    data_list.to_csv(f'{newpath}/{taxon_query}_{gene_name}_data.csv')
    taxon_list.to_csv(f'{newpath}/{taxon_query}_{gene_name}_stat.csv')
    print(f' >> Done - most recent fasta sequences were collected and successfully concatenated. \
It was saved in the file {taxon_query}_{gene_name}.fasta and is proove for nexus conversion. \
Summary of used sequences was saved as {taxon_query}_{gene_name}_data.csv. In addition the \
outgroup was added to the stat-file and saved as: {taxon_query}_{gene_name}_stat.csv.')
    stop_time = datetime.now()
    process_time = stop_time - start_time
    print(f' >> NCBI mining finished. It took {process_time.seconds // 60} min, \
{process_time.seconds % 60} sec. The files are stored in the directory {newpath}.')



---
#2 Parameters
NCBI Lumberjack uses 4 mandatory parameters <b>(see 2.1)</b> and several optional parameters <b>(see 2.2)</b>. You need to execute all blocks of both paragraphs!

<font color="grey"><b>OPTIONAL:</b> If you have already performed a NCBI Deforestation with the Lumberjack you can upload a \*data.csv or \*stat.csv file for your analysis in <b>paragraph 2.3<b>.</font>

##2.1 Mandatory arguments for search
Please enter your monophylum as query and run the block.

In [0]:
#{display-mode: 'form'}
#@markdown Name of monophyletic taxon:
taxon_query = 'Bufonidae' #@param {type:"string"}
#@markdown Name of gene:
gene_name = 'coi' #@param {type:"string"}
#@markdown Estimate of gene length (check at NCBI, pubmed, ...):
gene_length = 600 #@param {type:"number"}
#@markdown Enter your mail adress (mandatory for NCBI search):
my_mail = 'antilope.booty@evo.com' #@param {type:"string"}
#@markdown <br>**Execute the block!**

##2.2 Optional arguments for search
The following parameters can be adjusted. Please run the block afterwards.

###2.2.1 Do you want an outgroup?
Please type taxon names (lowest level: genus) in quotation marks, within square brackets; e.g. ['Soldanella', 'Primula', 'Gentiana'].
If not used, leave empty square brackets ([]) or type just `None`.


In [0]:
#{display-mode: 'form'}
outgroups = ['Rana'] #@param {type:"raw"}
#@markdown **Execute the block!**

###2.2.2 Is your gene_query for a protein coding sequence (i.e. gene)?
This is a very important question. For instance 'COI' encodes the first subunit of cytochrome oxidase. RNA coding sequences like '28S' ar here <b>no coding sequence</b>, so please untick the box for queries like this. Otherwise the fasta-file will be empty :-P


In [0]:
#{display-mode: 'form'}
coding_sequence = True #@param {type:"boolean"}
#@markdown **Execute the block!**

###2.2.3 Do you want to search faster?

Enter your NCBI API key (increases search pace by factor ~ 3). Read more here : https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

(!) You will need to set it within quotation marks (e.g. 'AZNE192930N8D...NDJE9D0').


In [0]:
#{display-mode: 'form'}
my_API_key =  None#@param {type:"raw"}
#@markdown **Execute the block!**<br>


###2.2.4 Do you want entries per genera or species of your monophylum?
Which resolution should your tree have (default: genus)? i.e. taxonomic level of external branches.

In [0]:
#{display-mode: 'form'}
tree_resolution = 'genus' #@param ['genus', 'species']
#@markdown **Execute the block!**


###2.2.5 Do you want to only retrieve proved entries for your gene?

In [0]:
#{display-mode: 'form'}
#@markdown This is the most important filter and recommended, especially for large monophyla! Inactivate if your \*data.csv file contains much less taxa than the \*stat.csv file.
strict_search = True #@param {type:"boolean"}
#@markdown **Execute the block!**


###2.2.6 How many entries do you want to retrieve?

In [0]:
#{display-mode: 'form'}
#@markdown How many species per genus do you want to retrieve at max? The species within a genus will be ranked after count of entries for the given gene. This only works if `tree_resolution` is genus.
entries_per_genus = 1 #@param {type:"slider", min:1, max:20, step:1}
#@markdown How many entries do you want to retrieve per species/taxon at max?
entries_per_tax = 1 #@param {type:"slider", min:1, max:20, step:1}
#@markdown **Execute the block!**


### 2.2.7 Other parameters

In [0]:
#{display-mode: 'form'}
#@markdown Usually the results will be sorted by increasing age, so the most recent entries should be retrieved. Do you want to change it to random order?
random_mining = False #@param {type:"boolean"}
#@markdown Filter: Tolerance (in percent) of estimated gene length to examine the retrieved sequences setting lower limit (default: 50 % tolerance). This filter is weak and not needed if you use `strict_search`. But you can play with it.
length_tolerance_percent = 50 #@param {type:"slider", min:0, max:100, step:1}
#@markdown Filteradjustment: Scaling the upper limit for tolerance of `length_tolerance_percent`.
upper_tolerance = 2 #@param {type:"number"}
#@markdown Enter the limit of retrieved results per search (default: 10000), super unimportant - do not change.
search_limit = 10000 #@param {type:"slider", min:1, max:10000, step:1}

taxonlist_path = None
#@markdown **Execute the block!**


Check if you have executed all blocks properly after entering the input. 

If you do not want to upload a csv file as query, you should jump now to paragraph 3 and start your datamining. Otherwise continue with 2.3 (**OPTIONAL**).

##<font color="grey">2.3 Upload of accession list as csv</font>
<font color="grey"><b>OPTIONAL:</b> Do you already have a set of selected accession IDs as a result of a previous use of lumberjack (i.e. deforestation) and you want to use it for a new search (e.g. different genes)? Then you should upload the \*stat.csv (= unfiltered) or \*data.csv (= filtered) file here:</font>

In [0]:
#@title <font color="grey">--> Upload here! <-- </font>{display-mode: 'form'}
#@markdown <font color="grey">First run the block and then select the file, you want to use. It will be uploaded. Once you done that, you need to rerun the paragraphs 2.1 and 2.2 to perform a new neutral analysis without this file as query. </font>

from google.colab import files
uploaded = files.upload() 

for fn in uploaded.keys():
  taxonlist_path = fn
  print(f'Upload successful, {fn} will be used in the following ncbi_miner session')

---
#3 Let's go!
It is getting serious. The following block is the query command: it will start your datamining. Once more, check if you have the right imput arguments und start it. This will take a lot of time. Sometimes more than one hour for large monophyla. Important is to keep the connection to web, but luckily the analysis runs on a fancy cloud outside your window, so your computer will not be bothered. The datamining runs in 3 steps and will talk to you:


1.   Getting a taxon database within the monophylum
2.   Getting a gene sequences database for your taxa
3.   Collecting filtered sequences for your taxa

If you want to read more about it, check out the paragraph beyond the execution block.

Warning: Sometimes there are problems in the connection, check your input, wait some time and it will be fine...

**And now: execute the block!**

<font color="grey">P.S. The sometimes weird percentages during the analysis are coded like this on purpose, to keep you entertained... ;-)</font>


In [0]:
#@markdown On the left you can see how the function is called. Run it, to conduct the deforestation of the NCBI jungle...
ncbi_miner(taxon_query = taxon_query,
               gene_name = gene_name,
               gene_length = gene_length,
               my_mail = my_mail,
               outgroups = outgroups,
               coding_sequence = coding_sequence,
               my_API_key = my_API_key, 
               tree_resolution = tree_resolution,
               length_tolerance_percent = length_tolerance_percent,
               upper_tolerance = upper_tolerance,
               search_limit = search_limit,
               entries_per_genus = entries_per_genus,
               entries_per_tax = entries_per_tax,
               taxonlist_path = taxonlist_path,
               random_mining = random_mining,
               strict_search = strict_search)

After the Lumberjack terminated download your search of perform a ReSearch (optional)...

###Background: mechanics of the Lumberjack OG:
In Step 1 your `taxon_query` (+ outgroup, but for the latter it is more complicated) is searched against the NCBI taxonomy database (esearch) and all taxa within the monophylum will be collected. 

Afterwards Step 2 begins with single NCBI taxonomy database requests (esummary) and all taxa are filtered for species or genus level (parameter: `tree_resolution`). This wonderful preselection will be run for the `gene_name` against the NCBI nucleotide database (i.e. genbank; esearch) and all entries are collected (sorted from latest to oldest or random; parameter: `random_mining`) and all together is saved as the \*stat.csv file. 

In the following Step 3, a maximum specified number of 

*   species per genus (parameter: `entries_per_genus`; works only if `tree_resolution` is genus)
*   entries per species (parameter: `entries_per_tax`)

will be retrieved from the NCBI nucleotide database (efetch). Therefore some filters (length, ...) are used.

In the end the \*data.csv file will be produced, all sequences concatenated in a single fasta-file and all 3 files are ready to download in a ZIP folder (see 5.)


---
#<font color="grey">4 ReSearch with different gene but same accessions</font>
<font color="grey"><b>OPTIONAL:</b> This block uses the data which was previously produced on the server to conduct a new search for a different gene. It is like you search for all taxa which had an entry in the preceding (first!) search you did with the last taxon, you searched for.

Note: If you want to upload a file from your computer, change `gene_name`and `gene_length` in 2.1, upload the file in 2.3, run all blocks in the paragraphs 2.1 - 2.3 and finally `ncbi_miner` in 3.</font>

##<font color="grey">4.1 Mandatory arguments for ReSearch</font>
<font color="grey">Please enter your new gene name as query type in the date of your preceding search and run the next two blocks.</font>

In [0]:
#{display-mode: 'form'}
#@markdown <font color="grey">What is the name of your new gene?</font>
new_gene_name = '28s' #@param {type:"string"}
#@markdown <font color="grey">Estimate the length of your new gene:</font>
new_gene_length = 800 #@param {type:"number"}
#@markdown <font color="grey">When did you perform the search for accessions for the old gene?</font>
date_of_old_search = "2020-01-06"#@param {type:"date"}
date_underscore = date_of_old_search.replace('-', '_')
coding_sequence = False #@param {type:"boolean"}

old_file = f'{date_underscore}_{taxon_query}_{gene_name}/{taxon_query}_{gene_name}_stat.csv'

##<font color="grey">4.2 Run the ReSearch</font>
<font color="grey">Check if you adjusted the parameters in 4.1 right and execute the next block. 

Note: It will not overwrite the files produced by the search before!</font>

In [0]:
#{display-mode: 'form'}
#@markdown <font color="grey"> Run the search with new gene - old accessions (same parameters).</font>
ncbi_miner(taxon_query = taxon_query,
               gene_name = new_gene_name,
               gene_length = new_gene_length,
               my_mail = my_mail,
               coding_sequence = coding_sequence,
               my_API_key = my_API_key,
               tree_resolution = tree_resolution,
               length_tolerance_percent = length_tolerance_percent,
               upper_tolerance = upper_tolerance,
               search_limit = search_limit,
               entries_per_genus = entries_per_genus,
               entries_per_tax = entries_per_tax,
               taxonlist_path = old_file,
               random_mining = random_mining)



---
#5 Download
After producing your data, you can download it in a zip file containing fasta file (sequences) and the corresponding data (\*data.csv) as well as information about the gene entries of your monophylum in genbank (\*stat.csv).

In [0]:
#@title 5.1 Zip file for (Re)Search\* {display-mode: 'form'}
from datetime import datetime
import shutil
from google.colab import files
#@markdown When did you perform the (re)search you want to download?
date_of_search = "2020-01-06"#@param {type:"date"}
date_uscore = date_of_search.replace('-', '_')
#@markdown Did You perform a search (paragraph 3) or  ReSearch (paragraph 4)?
option = 'search' #@param ['search', 'research']

if option == 'search':
  zip_gene = gene_name
else:
  zip_gene = new_gene_name
directory = f'{date_uscore}_{taxon_query}_{zip_gene}'
folder = shutil.make_archive(directory, 'zip', directory)
files.download(folder)


#@markdown <br><br>The data you produced will be stored into a ZIP file and downloaded. <br> <br> \*) Search and ReSearch store their files in different folders. Those would need to be downloaded subsequently.
#@markdown <br><br><b>Execute this block for download!</b>



---
---
#References
<i>Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009):</i> Biopython - freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423

---
<font color="grey">Thank you for using the NCBI lumberjack, have fun with the sequences!<br>
`Questions: thomas.huber{ett}evobio.eu`</font>