# Genbank phylogeny from COI sequences

Complete mitogenome genbank records were downloaded for 4 ant species, each from a different subfamily:
  - ***Pseudomyrmex gracilis*** (Pseudomyrmecinae)
  - ***Formica fusca*** (Formicinae)
  - ***Linepthema humile*** (Dolichoderinae)
  - ***Solenopsis invicta*** (Myrmicinae)

In [1]:
# importing everything we'll need
from Bio import SeqIO
import os, glob
import pandas as pd
import skbio.io

Now, we'll be using SeqIO to obtain the COI nucleotide sequence for each species and save it into separate files:

In [41]:
def create_dir(dir_name):
    os.makedirs(os.path.dirname(dir_name), exist_ok=True)

def extract_COI(gb_file):
    for record in SeqIO.parse(gb_file, "genbank"):
        species_name = record.annotations.get('organism').replace(" ", "_")
        filename = "./coi_seqs/{}_coi.fa".format(species_name)
        create_dir(filename)
        with open(filename, "w") as coi_file:
            for gene in record.features:
                if gene.type in ["CDS"] and gene.qualifiers.get('gene')[0] in ['COX1', 'COI']:
                    header = "{}-{}".format(species_name, gene.qualifiers.get('gene')[0])
                    sequence = gene.location.extract(record.seq) # Mas seq pode ter stop codon truncado
                    if len(sequence) % 3 == 1:
                        sequence += "AA" #Resto 1 - Precisa adicionar 'AA'
                    elif len(sequence) % 3 == 2:
                        sequence += "A" #Resto 2 - Precisa adicionar 'A'
                    coi_file.write(">{}\n{}\n".format(header, sequence))

for gb_file in glob.glob("./ant_mitogenomes/*.gb"):
    extract_COI(gb_file)

Running blastn with the COI sequences against NCBI's formicidae sequences:

**OBS:** Need to install [taxdb database](ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz).

In [27]:
%%bash

# Extracting taxdb database (if necessary)
taxdb=$(pwd)/taxdb
if [[ ! -d "$taxdb" ]]; then 
    echo "taxdb dir not found. Extracting taxdb database..."
    mkdir $taxdb
    tar -C $taxdb -xaf taxdb.tar.gz
else
    echo "taxdb dir found"
fi

# Setting BLASTDB variable
echo "Setting BLASTDB variable to $taxdb"
export BLASTDB=$taxdb

# Creating directory for blast results (if necessary)
if [[ ! -d "blast_results" ]]; then 
    echo "Creating dirrectory for blast results"
    mkdir blast_results
else
    echo "blast_results directory already created"
fi

# Performing blast searches 
for coi in ./coi_seqs/*; do
    echo "Running blast search for $coi..." && 
    #blastn -query $coi -db ./blast_teste/ant_mito -out ./blast_results/$(basename $coi .fa).blast -outfmt "6 qseqid sseqid staxids sscinames stitle sacc saccver slen sstart send qseq";
    blastn -query $coi -db nr -max_target_seqs 100 -remote -entrez_query "Formicidae [Organism]" -outfmt "7 qseqid sseqid staxids sscinames stitle sacc saccver slen sstart send qseq" -out ./blast_results/$(basename $coi .fa).blast; # set max_target_seqs to higher value when running this for real
done 

taxdb dir found
Setting BLASTDB variable to /home/gabriel/Dropbox/repos/genbank_phylogeny/taxdb
blast_results directory already created
Running blast search for ./coi_seqs/Formica_fusca_coi.fa...
Running blast search for ./coi_seqs/Linepithema_humile_coi.fa...
Running blast search for ./coi_seqs/Pseudomyrmex_gracilis_coi.fa...
Running blast search for ./coi_seqs/Solenopsis_invicta_coi.fa...


**NOTE:** Blast with the `-remote` flag can take quite some time to run and is not compatible with `-taxidlist`... If hard disk space is not a problem, maybe it would be better to just download the entire nt database and run it locally...

Saving the blast results into dataframes:

In [35]:
def create_dataframe(blast_result):
    with open(blast_result) as blast7:
        df = skbio.io.read(blast7, format='blast+7', into=pd.DataFrame)
        return df

def extract_columns_dataframe(df):
    df = df[['qseqid', 'staxids', 'sscinames', 'sacc', 'sseq']]
    df['sseq'] = df['sseq'].apply(lambda x: x.replace('-', ''))
    #df['sseqlen'] = len(df['sseq'])
    df['sseqlen'] = df.apply(lambda row: len(row.sseq), axis = 1) 
    df = df[['qseqid', 'staxids', 'sscinames', 'sacc', 'sseqlen', 'sseq']]
    return df
    
#df = extract_columns_dataframe(create_dataframe('blast_results/Formica_fusca_coi_old.blast'))
#print(df)
#df.to_excel("blast.xlsx")

In [13]:
#Testing methods of dataframe
#df[df["staxids"].str.contains(";")].index
#df.index
#df[";" in df.staxids].index

Int64Index([8, 306, 314, 499], dtype='int64')

Now that we have the blast results in a dataframe, we can clean it in order to:

-  Remove rows with more than one taxid;
-  Sort dataframe (descending) for both taxid and sseqlen;
-  Keep only one record by taxid (the one with the longest sseqlen)

In [53]:
def clean_dataframe(df):
    clean_df = df.drop(df[df.staxids.str.contains(";")].index) # Removing rows with hybrid sequences (more than one taxid value)
    #clean_df["staxids"] = pd.to_numeric(clean_df["staxids"])
    clean_df = clean_df.sort_values(by=["staxids", "sseqlen"], ascending=False) # Sorting dataframe by taxid and sseqlen (descending) - Guarantees that highest sseqlen will always be the first row for that taxid
    # Printing all rows to check output
    #with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
    #    print(clean_df[["sscinames", "sacc", "staxids", "sseqlen"]])
    clean_df = clean_df.drop_duplicates(subset="staxids", keep='first') # Keeps only one record per txid. The one that has the highest sseqlen
    return clean_df

#print(clean_dataframe(df).dtypes)
#
#with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
#    print(clean_dataframe(df)[["sscinames", "sacc", "staxids", "sseqlen"]])

Now that we have the functions to extract and clean the data, we have to concatenate the blast results into a single, final dataframe:

In [63]:
blast_data = []
for blast_result in glob.glob("./blast_results/*.blast"):
    blast_data.append(clean_dataframe(extract_columns_dataframe(create_dataframe(blast_result))))
blast_alldata = pd.concat(blast_data)

  warn("%r does not look like a %s file"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sseq'] = df['sseq'].apply(lambda x: x.replace('-', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sseqlen'] = df.apply(lambda row: len(row.sseq), axis = 1)
  warn("%r does not look like a %s file"
  warn("%r does not look like a %s file"
  warn("%r does not look like a %s file"


Despite the warnings, the resulting dataframe is correctly formatted and henceforth suitable for downstream analyses:

In [65]:
blast_alldata

Unnamed: 0,qseqid,staxids,sscinames,sacc,sseqlen,sseq
164,Formica_fusca-COX1,88063,Messor bouvieri,DQ074325,1248,GGATCATCTATAAGAATGATTATTCGACTTGAATTAGGATCATGTA...
109,Formica_fusca-COX1,84561,Oecophylla smaragdina,AB185475,1042,CCTTTAATATTAGGATCGCCTGATATAGCATATCCCCGTATAAATA...
4,Formica_fusca-COX1,84560,Formica lemani,AB019425,974,ATTCCCTTAATACTAGGATCTCCAGACATAGCTTATCCTCGTATAA...
40,Formica_fusca-COX1,84555,Polyrhachis dives,KT266831,1530,ATGAAAAAATGACTCTATTCAACTAACCATAAAGATATTGGAATGT...
168,Formica_fusca-COX1,81629,Messor structor,KT184578,1367,AGAATAATTATCCGACTTGAACTAGGGTCCTGTAACTCATTAATTA...
...,...,...,...,...,...,...
439,Solenopsis_invicta-COX1,144042,Pogonomyrmex rugosus,FJ824455,1371,ATAATTATTCGACTTGAACTTGGTTCATGTAATAGCTTAATTAATA...
0,Solenopsis_invicta-COX1,13686,Solenopsis invicta,HQ215538,1529,ATGAATAAATGACTTTTTTCAACAAATCACAAAGACATTGGAATTT...
4,Solenopsis_invicta-COX1,121131,Solenopsis geminata,HQ215537,1529,ATGAACAAATGATTTTTTTCAACTAATCACAAAGATATTGGAATTT...
327,Solenopsis_invicta-COX1,1031672,Messor minor x Messor cf. wasmanni BCSS-2011,EU441274,1212,TGTAATTCATTAATTAACAATGATCAAATTTATAATACTTTAGTGA...


Lastly, let us see the percentage of hits shared between all sequences: