# Notebook 1: Homology matrix generation from genome sequences

In this notebook, I will be applying the notebooks accompanying the paper by Norsigian et al., 2020. (doi:10.1038/s41596-019-0254-3.) I will apply this for the P. thermo model we've been working on and some related strains:
- G. thermoglucosidaius DSM2542
- G. thermoglucosidasius C56-YS93
- Geobacillus thermoglucosidans TNO-09.020
- Geobacillus sp. Y4.1MC1
- Geobacillus sp. WCH70
- Geobacillus LC300
- Geobacillus stearothermophilus DSM 458
- Geobacillus thermodenitrificans T12


This is the the first notebook in the tutorial to create homology matrix from genome sequences.There are four major steps in this notebook
1. Download the genome annotation (GenBank files) from NCBI, and generate fasta files (protein &nucleotide) from them
2. Perform BLASTp to find homologous proteins in strains of interest
3. Use best bidirectional hits to create gene presence/absence matrix
4. Supplementary for best practice: use BLASTn to check if we have missed any unannotated open reading frames and retain these genes in orthology matrix as well as guide future manual curation

In [6]:
#import packages needed
import pandas as pd
from glob import glob

In [7]:
from Bio import Entrez, SeqIO

In [8]:
import sys

In [9]:
import cobra

__NOTE__ to be able to import the Entrez and SeqIO, I need to change the folder name from 'bio' to 'Bio' and then it'll work. 

C:\Users\vivmol\AppData\Local\Continuum\anaconda3\envs\g-thermo\Lib\site-packages

So be careful whenever i install Biopython again that this needs to be fixed.

Here I will be working with strains in the faculative anaerobic clade of the genus. I will also add genomes that are obligate aerobes to see if that could highlight to us what changed between these species that made them become obligate aerobes. 

In [5]:
# Load the information on the five strains we will be working with in this tutorial
StrainsOfInterest=pd.read_excel('Strain Information.xlsx')
StrainsOfInterest

Unnamed: 0,Strain,NCBI ID,Oxygen requirement
0,Geobacillus stearothermophilus DSM 458,CP016552.1,Obligate aerobe
1,Geobacillus thermodenitrificans T12,CP020030.1,Obligate aerobe
2,G. thermoglucosidaius DSM2542,CP012712.1,Facultative anaerobe
3,G. thermoglucosidasius C56-YS93,CP002835.1,Facultative anaerobe
4,Geobacillus sp. WCH70,CP001638,Facultative anaerobe
5,Geobacillus LC300,CP008903.1,Obligate aerobe


In [6]:
#The Reference Genome is as Described in the Base Reconstruction; here the reference is 
referenceStrainID='CP016622.1'
targetStrainIDs=list(StrainsOfInterest['NCBI ID'])

## 1. Download genome annotations (GenBank files) to generate fasta files 

### Dowload genomes from NCBI
Download the genome annotations (GenBank files) from NCBI for strains of interest. 

In [7]:
# define a function to download the annotated genebank files from NCBI
def dl_genome(id, folder='genomes'): # be sure get CORRECT ID
    files=glob('%s/*.gb'%folder)
    out_file = '%s/%s.gb'%(folder, id)

    if out_file in files:
        print (out_file, 'already downloaded')
        return
    else:
        print ('downloading %s from NCBI'%id)
        
    from Bio import Entrez
    Entrez.email = "vivmol@biosustain.dtu.dk"     #Insert email here for NCBI
    handle = Entrez.efetch(db="nucleotide", id=id, rettype="gb", retmode="text")
    fout = open(out_file,'w')
    fout.write(handle.read())
    fout.close()

In [8]:
# execute the above function, and download the GenBank files for 8 P. thermo strains
for strain in targetStrainIDs:
    dl_genome(strain, folder='genomes')

downloading CP016552.1 from NCBI
downloading CP020030.1 from NCBI
downloading CP012712.1 from NCBI
downloading CP002835.1 from NCBI
downloading CP001638 from NCBI
downloading CP008903.1 from NCBI


In [9]:
#also download the reference strain info
dl_genome(referenceStrainID, folder='genomes')

downloading CP016622.1 from NCBI


### Examine the Downloaded Strains

In [10]:
# define a function to gather information of the downloaded strains from the GenBank files
def get_strain_info(folder='genomes'):
    files = glob('%s/*.gb'%folder)
    strain_info = []
    
    for file in files:
        handle = open(file)
        record = SeqIO.read(handle, "genbank")
        
        for f in record.features:
            if f.type=='source':
                info = {}
                info['file'] = file
                info['id'] = file.split('\\')[-1].split('.')[0]
                for q in f.qualifiers.keys():
                    info[q] = '|'.join(f.qualifiers[q])
                strain_info.append(info)
    return pd.DataFrame(strain_info)

In [11]:
# information on the downloaded strain
get_strain_info(folder='genomes')

Unnamed: 0,file,id,organism,mol_type,strain,db_xref,isolation_source,culture_collection,type_material,country,collection_date,lat_lon
0,genomes\CP001638.gb,CP001638,Geobacillus sp. WCH70,genomic DNA,WCH70,taxon:471223,,,,,,
1,genomes\CP002835.1.gb,CP002835,Parageobacillus thermoglucosidasius C56-YS93,genomic DNA,C56-YS93,taxon:634956,,,,,,
2,genomes\CP008903.1.gb,CP008903,Geobacillus sp. LC300,genomic DNA,LC300,taxon:1519377,,,,,,
3,genomes\CP012712.1.gb,CP012712,Parageobacillus thermoglucosidasius,genomic DNA,DSM 2542,taxon:1426,soil,DSM:2542,type strain of Parageobacillus thermoglucosida...,China: Beijing,01-Jan-2010,
4,genomes\CP016552.1.gb,CP016552,Geobacillus stearothermophilus,genomic DNA,DSM 458,taxon:1422,sugar beet juice from extraction installations,DSM:458,,Austria,,
5,genomes\CP016622.1.gb,CP016622,Parageobacillus thermoglucosidasius,genomic DNA,NCIMB 11955,taxon:1426,,NCIMB:11955,type strain of Parageobacillus thermoglucosida...,United Kingdom: Surrey,01-Jun-2014,
6,genomes\CP020030.1.gb,CP020030,Geobacillus thermodenitrificans,genomic DNA,T12,taxon:33940,soil,,,Netherlands:ede,22-Oct-2012,52.0439 N 5.6167 E


### Generate FASTA files for both Protein and Nucleotide Pipelines
From the GenBank file, we can extract sequence and annoation information to generate fasta files for the protein and nucleotide analyses. The resulting fasta files will then be used in step 2 as input for BLAST 

In [12]:
# define a function to parse the Genbank file to generate fasta files for both protein and nucleotide sequences
def parse_genome(id, type='prot', in_folder='genomes', out_folder='prots', overwrite=1):

    in_file = '%s/%s.gb'%(in_folder, id)
    out_file='%s/%s.fa'%(out_folder, id)
    files =glob('%s/*.fa'%out_folder)
    
    if out_file in files and overwrite==0:
        print (out_file, 'already parsed')
        return
    else:
        print ('parsing %s'%id)
    
    handle = open(in_file)
    
    fout = open(out_file,'w')
    x = 0
    
    records = SeqIO.parse(handle, "genbank")
    for record in records:
        for f in record.features:
            if f.type=='CDS':
                seq=f.extract(record.seq)
                
                if type=='nucl':
                    seq=str(seq)
                else:
                    seq=str(seq.translate())
                    
                if 'locus_tag' in f.qualifiers.keys():
                    locus = f.qualifiers['locus_tag'][0]
                elif 'gene' in f.qualifiers.keys():
                    locus = f.qualifiers['gene'][0]
                else:
                    locus = 'gene_%i'%x
                    x+=1
                fout.write('>%s\n%s\n'%(locus, seq))
    fout.close()

In [13]:
# Generate fasta files for 5 strains of interest
for strain in targetStrainIDs:
    parse_genome(strain, type='prot', in_folder='genomes', out_folder='prots')
    parse_genome(strain, type='nucl', in_folder='genomes', out_folder='nucl')


parsing CP016552.1
parsing CP016552.1
parsing CP020030.1
parsing CP020030.1
parsing CP012712.1
parsing CP012712.1
parsing CP002835.1
parsing CP002835.1
parsing CP001638
parsing CP001638
parsing CP008903.1
parsing CP008903.1


In [14]:
#Also generate fasta files for the reference strain
parse_genome(referenceStrainID, type='nucl', in_folder='genomes', out_folder='nucl')
parse_genome(referenceStrainID, type='prots', in_folder='genomes', out_folder='prots')

parsing CP016622.1
parsing CP016622.1




## 2. Perform BLAST to find homologous proteins in strains of interest

### Make BLAST DB for each of the target strains for both Protein and Nucleotide Pipelines

In this tutorial, we will run both BLASTp for proteins and BLSATn for nucleotides. BLASTp will be used as the main approach to identify homologous proteins in reference strain and other strains of interest, while BLASTn will be used as a supplementary method to check for any unannotated genes

In [15]:
# Define a function to make blast database for either protein of nucleotide
def make_blast_db(id,folder='prots',db_type='prot'):
    import os
    
    out_file ='%s/%s.fa.pin'%(folder, id)
    files =glob('%s/*.fa.pin'%folder)
    
    if out_file in files:
        print (id, 'already has a blast db')
        return
    if db_type=='nucl':
        ext='fna'
    else:
        ext='fa'

    cmd_line='makeblastdb -in %s/%s.%s -dbtype %s' %(folder, id, ext, db_type)
    
    print ('making blast db with following command line...')
    print (cmd_line)
    os.system(cmd_line)

In [5]:
sys.path.append('..\\..\\..\\..\\..\\..\\Program Files\\NCBI\\blast-2.10.1+\\bin')

In [17]:
# make protein sequence databases 
# Because we are performing bi-directional blast, we make databases from both reference strain and strains of interest
for strain in targetStrainIDs:
    make_blast_db(strain,folder='prots',db_type='prot')
make_blast_db(referenceStrainID,folder='prots',db_type='prot')

making blast db with following command line...
makeblastdb -in prots/CP016552.1.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP020030.1.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP012712.1.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP002835.1.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP001638.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP008903.1.fa -dbtype prot
making blast db with following command line...
makeblastdb -in prots/CP016622.1.fa -dbtype prot


### Define functions to run protein BLAST and get sequence lengths
- BLASTp will be the main approach used here to identify homologous proteins between strains 
- Aside from sequence similarity, we also want to ensure the coverage of sequence mapping is sufficient. Therefore, we need to identiy the sequence length for each protein and compare it with the alignment length.

In [18]:
# define a function to run BLASTp
def run_blastp(seq,db,in_folder='prots', out_folder='bbh', out=None,outfmt=6,evalue=0.001,threads=1):
    import os
    if out==None:
        out='%s/%s_vs_%s.txt'%(out_folder, seq, db)
        print(out)
    
    files =glob('%s/*.txt'%out_folder)
    if out in files:
        print (seq, 'already blasted')
        return
    
    print ('blasting %s vs %s'%(seq, db))
    
    db = '%s/%s.fa'%(in_folder, db)
    seq = '%s/%s.fa'%(in_folder, seq)
    cmd_line='blastp -db %s -query %s -out %s -evalue %s -outfmt %s -num_threads %i' \
    %(db, seq, out, evalue, outfmt, threads)
    
    print ('running blastp with following command line...')
    print (cmd_line)
    os.system(cmd_line)
    return out

In [19]:
# define a function to get sequence length 

def get_gene_lens(query, in_folder='prots'):

    file = '%s/%s.fa'%(in_folder, query)
    handle = open(file)
    records = SeqIO.parse(handle, "fasta")
    out = []
    
    for record in records:
        out.append({'gene':record.name, 'gene_length':len(record.seq)})
    
    out = pd.DataFrame(out)
    return out

## 3. Use Bi-Directional BLASTp Best Hits to create gene presence/absence matrix

### Obtain Bi-Directional BLASTp Best Hits

From the above BLASTp results, we can obtain Bi-Directional BLASTp Best Hits to identify homologous proteins. Note beside gene similarity score, the coverage of alignment is also used to filter mapping results. 

In [20]:
# define a function to get Bi-Directional BLASTp Best Hits
def get_bbh(query, subject, in_folder='bbh'):    
    
    #Utilize the defined protein BLAST function
    run_blastp(query, subject)
    run_blastp(subject, query)
    
    query_lengths = get_gene_lens(query, in_folder='prots')
    subject_lengths = get_gene_lens(subject, in_folder='prots')
    
    #Define the output file of this BLAST
    out_file = '%s/%s_vs_%s_parsed.csv'%(in_folder,query, subject)
    files=glob('%s/*_parsed.csv'%in_folder)
    
    #Combine the results of the protein BLAST into a dataframe
    print ('parsing BBHs for', query, subject)
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    bbh=pd.read_csv('%s/%s_vs_%s.txt'%(in_folder,query, subject), sep='\t', names=cols)
    bbh = pd.merge(bbh, query_lengths) 
    bbh['COV'] = bbh['alnLength']/bbh['gene_length']
    
    bbh2=pd.read_csv('%s/%s_vs_%s.txt'%(in_folder,subject, query), sep='\t', names=cols)
    bbh2 = pd.merge(bbh2, subject_lengths) 
    bbh2['COV'] = bbh2['alnLength']/bbh2['gene_length']
    out = pd.DataFrame()
    
    # Filter the genes based on coverage
    bbh = bbh[bbh.COV>=0.25]
    bbh2 = bbh2[bbh2.COV>=0.25]
    
    #Delineate the best hits from the BLAST
    for g in bbh.gene.unique():
        res = bbh[bbh.gene==g]
        if len(res)==0:
            continue
        best_hit = res.loc[res.PID.idxmax()]
        best_gene = best_hit.subject
        res2 = bbh2[bbh2.gene==best_gene]
        if len(res2)==0:
            continue
        best_hit2 = res2.loc[res2.PID.idxmax()]
        best_gene2 = best_hit2.subject
        if g==best_gene2:
            best_hit['BBH'] = '<=>'
        else:
            best_hit['BBH'] = '->'
        out=pd.concat([out, pd.DataFrame(best_hit).transpose()])
    
    #Save the final file to a designated CSV file    
    out.to_csv(out_file)

In [21]:
# Execute the BLAST for each target strain against the reference strain, save results to 'bbh' i.e. "bidirectional best
# hits" folder to create
# homology matrix

for strain in targetStrainIDs:
    get_bbh(referenceStrainID,strain, in_folder='bbh')

bbh/CP016622.1_vs_CP016552.1.txt
blasting CP016622.1 vs CP016552.1
running blastp with following command line...
blastp -db prots/CP016552.1.fa -query prots/CP016622.1.fa -out bbh/CP016622.1_vs_CP016552.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
bbh/CP016552.1_vs_CP016622.1.txt
blasting CP016552.1 vs CP016622.1
running blastp with following command line...
blastp -db prots/CP016622.1.fa -query prots/CP016552.1.fa -out bbh/CP016552.1_vs_CP016622.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
parsing BBHs for CP016622.1 CP016552.1
bbh/CP016622.1_vs_CP020030.1.txt
blasting CP016622.1 vs CP020030.1
running blastp with following command line...
blastp -db prots/CP020030.1.fa -query prots/CP016622.1.fa -out bbh/CP016622.1_vs_CP020030.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
bbh/CP020030.1_vs_CP016622.1.txt
blasting CP020030.1 vs CP016622.1
running blastp with following command line...
blastp -db prots/CP016622.1.fa -query prots/CP020030.1.fa -out bbh/CP020030.1_vs_CP016622.1.txt -evalue 

### Parse the BLAST Results into one Homology Matrix of the Reconstruction Genes

For the homology matrix, we only focus on genes that are present in the reference model

In [22]:
#Load all the BLAST files between the reference strain and target strains

blast_files=glob('%s/*_parsed.csv'%'bbh')

for blast in blast_files:
    bbh=pd.read_csv(blast)
    print (blast,bbh.shape) 

bbh\CP016622.1_vs_CP001638_parsed.csv (0, 1)
bbh\CP016622.1_vs_CP002835.1_parsed.csv (0, 1)
bbh\CP016622.1_vs_CP008903.1_parsed.csv (0, 1)
bbh\CP016622.1_vs_CP012712.1_parsed.csv (0, 1)
bbh\CP016622.1_vs_CP016552.1_parsed.csv (0, 1)
bbh\CP016622.1_vs_CP020030.1_parsed.csv (0, 1)


In [10]:
#Load the base reconstruction to designate the list of genes within the model
model = cobra.io.read_sbml_model('../../../model/g-thermo.xml')

In [11]:
listGeneIDs=[]
for gene in model.genes:
    listGeneIDs.append(gene.id)

In [21]:
#Create 2 matrices of N, rows where N is the number of model genes and M columns where M is the number of target strains
#One matrix will be populated with the PID results from the blasts and another with the mapping of gene locus tags

ortho_matrix=pd.DataFrame(index=listGeneIDs,columns=targetStrainIDs)
geneIDs_matrix=pd.DataFrame(index=listGeneIDs,columns=targetStrainIDs)

In [22]:
#Parse through each blast file and acquire pertinent information for each matrix for each of the base reconstruction genes
for blast in blast_files:
    bbh=pd.read_csv(blast)
    listIDs=[]
    listPID=[]
    for r,row in ortho_matrix.iterrows():
        try:
            currentOrtholog=bbh[bbh['gene']==r].reset_index()
            listIDs.append(currentOrtholog.iloc[0]['subject'])
            listPID.append(currentOrtholog.iloc[0]['PID'])
        except:
            listIDs.append('None')
            listPID.append(0)
    for col in ortho_matrix.columns:
        if col in blast:
            ortho_matrix[col]=listPID
            geneIDs_matrix[col]=listIDs

### Apply Similarity Threshold to Binarize  Homology Matrix to Presence/Absence Matrix
In this step, choose a threshold for the PID to determine if a gene is a absent/present in the strain of interest. We can then convert the homology matrix generated above into a binarized presence/absence matrix

In [23]:
# In this tutoriao, genes with a greater than 80% PID are considered present in the target strain genome 
# and consequently less than 80% are considered absent from the target strain genome
for column in ortho_matrix:
    ortho_matrix.loc[ortho_matrix[column]<=80.0,column]=0
    ortho_matrix.loc[ortho_matrix[column]>80.0,column]=1

In [24]:
ortho_matrix

Unnamed: 0,CP000946.1,CU651637.1,CP002167.1,CU928163.2,CU928164.2
b2551,1.0,1.0,1.0,1.0,1.0
b0870,1.0,1.0,1.0,1.0,1.0
b3368,1.0,1.0,1.0,1.0,1.0
b2436,1.0,1.0,1.0,1.0,1.0
b3500,1.0,1.0,1.0,1.0,1.0
b0945,1.0,1.0,1.0,1.0,1.0
b4467,1.0,1.0,1.0,1.0,1.0
b4468,1.0,1.0,1.0,1.0,1.0
b2979,1.0,1.0,1.0,1.0,1.0
b3916,1.0,1.0,1.0,1.0,1.0


## 4. Perform BLASTn to check unannotated open reading frames to guide manual curation 
At this juncture it may be useful to execute a supplementary nucleotide BLAST to check for unannotated genes, results here become candidates for manual curation. In this tutorial we retain unannotated genes that pass the threhsold for
similarity and contain no premature stop codons

In [43]:
#Define a function to generate FNA from the GBK files
def gbk2fasta(gbk_filename):
    faa_filename = '.'.join(gbk_filename.split('.')[:-1])+'.fna'
    input_handle  = open(gbk_filename, "r")
    output_handle = open(faa_filename, "w")

    for seq_record in SeqIO.parse(input_handle, "genbank") :
        print ("Converting GenBank record %s" % seq_record.id)
        output_handle.write(">%s %s\n%s\n" % (
               seq_record.id,
               seq_record.description,
               seq_record.seq))

    output_handle.close()
    input_handle.close()

In [44]:
#Define function to run the BLASTn
def run_blastn(seq, db,outfmt=6,evalue=0.001,threads=1):
    import os
    out = 'nucl/'+seq+'_vs_'+db+'.txt'
    seq = 'nucl/'+seq+'.fa'
    db = 'genomes/'+db+'.fna'
    
    cmd_line='blastn -db %s -query %s -out %s -evalue %s -outfmt %s -num_threads %i' \
    %(db, seq, out, evalue, outfmt, threads)
    
    print ('running blastn with following command line...')
    print (cmd_line)
    os.system(cmd_line)
    return out

In [42]:
# make nucleotide sequence databases 
for strain in targetStrainIDs:
    make_blast_db(strain,folder='genomes',db_type='nucl')

making blast db with following command line...
makeblastdb -in genomes/CP000946.1.fna -dbtype nucl
making blast db with following command line...
makeblastdb -in genomes/CU651637.1.fna -dbtype nucl
making blast db with following command line...
makeblastdb -in genomes/CP002167.1.fna -dbtype nucl
making blast db with following command line...
makeblastdb -in genomes/CU928163.2.fna -dbtype nucl
making blast db with following command line...
makeblastdb -in genomes/CU928164.2.fna -dbtype nucl


In [45]:
# convert genbank files to fna files for strains of interest
for strain in targetStrainIDs:
    gbk2fasta('genomes/'+strain+'.gb')

Converting GenBank record CP000946.1
Converting GenBank record CU651637.1
Converting GenBank record CP002167.1
Converting GenBank record CU928163.2
Converting GenBank record CU928164.2


In [46]:
# perform uni-directional BLASTn hit
genome_blast_res=[]
for strain in targetStrainIDs:
    res = run_blastn(referenceStrainID,strain)
    genome_blast_res.append(res)

running blastn with following command line...
blastn -db genomes/CP000946.1.fna -query nucl/NC_000913.3.fa -out nucl/NC_000913.3_vs_CP000946.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
running blastn with following command line...
blastn -db genomes/CU651637.1.fna -query nucl/NC_000913.3.fa -out nucl/NC_000913.3_vs_CU651637.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
running blastn with following command line...
blastn -db genomes/CP002167.1.fna -query nucl/NC_000913.3.fa -out nucl/NC_000913.3_vs_CP002167.1.txt -evalue 0.001 -outfmt 6 -num_threads 1
running blastn with following command line...
blastn -db genomes/CU928163.2.fna -query nucl/NC_000913.3.fa -out nucl/NC_000913.3_vs_CU928163.2.txt -evalue 0.001 -outfmt 6 -num_threads 1
running blastn with following command line...
blastn -db genomes/CU928164.2.fna -query nucl/NC_000913.3.fa -out nucl/NC_000913.3_vs_CU928164.2.txt -evalue 0.001 -outfmt 6 -num_threads 1


In [47]:
#define a function to parse through the nucleotide BLAST results and form one matrix of all the results
def parse_nucl_blast(infile):
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    data = pd.read_csv(infile, sep='\t', names=cols)
    data = data[(data['PID']>80) & (data['alnLength']>0.8*data['queryEnd'])]
    data2=data.groupby('gene').first()
    return data2.reset_index()


In [48]:
# parse the nucleotide blast matrix 
na_matrix=pd.DataFrame()
for file in genome_blast_res:
    genes =parse_nucl_blast(file)
    name ='.'.join(file.split('_')[-1].split('.')[:-1])
    na_matrix = na_matrix.append(genes[['gene','subject','PID']])
na_matrix = pd.pivot_table(na_matrix, index='gene', columns='subject',values='PID')

In [49]:
na_matrix.head()

subject,CP000946.1,CP002167.1,CU651637.1,CU928163.2,CU928164.2
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
b0002,97.97,97.69,97.56,98.78,98.86
b0003,98.71,98.29,98.29,98.29,99.68
b0004,98.99,98.06,97.75,98.21,97.82
b0005,99.66,91.09,97.03,97.98,98.73
b0006,99.23,97.68,97.55,98.46,98.58


### Examine unnannotated open reading frames
We compare the results from BLASTp and BLASTn and record any inconsistencies between the two matrices due to missing annotation. This result is then saved to guide future manual curation. 

In [50]:
# define a function to extract the sequence from fna file 
def extract_seq(g, contig, start, end):
    from Bio import SeqIO
    handle = open(g)
    records = SeqIO.parse(handle, "fasta")
    
    for record in records:
        if record.name==contig:
            if end>start:
                section = record[start:end]
            else:
                section = record[end-1:start+1].reverse_complement()
                
            seq = str(section.seq)
    return seq

In [51]:
#Define updated matrices that will include genes based on sequence evidence that were missing due to lack of annotation
ortho_matrix_w_unannotated = ortho_matrix.copy()
geneIDs_matrix_w_unannotated = geneIDs_matrix.copy()

In [52]:
#Define matrix of the BLASTn results for all the pertinent model genes
nonModelGenes=[]
for g in na_matrix.index:
    if g not in listGeneIDs:
        nonModelGenes.append(g)

na_model_genes=na_matrix.drop(nonModelGenes)

In [53]:
#For each strain in the ortho_matrix, identify genes that meet threshold of SEQ similarity, but missing from
#annotated ORFS. Additionally, look at the sequence to ensure that these cases do not have early stop codons indicating
#nonfunctional even if the NA seqs meet the threshold

pseudogenes = {}

for c in ortho_matrix.columns:
    
    orfs = ortho_matrix[c]
    genes = na_model_genes[c]
    # All the Model Genes that met the BLASTp Requirements
    orfs2 = orfs[orfs==1].index.tolist()
    # All the Model Genes based off of BLASTn similarity above threshold of 80
    genes2 = genes[genes>=80].index.tolist()
    # By Definition find the genes that pass sequence threshold but were NOT in annotated ORFs:
    unannotated = set(genes2) -set(orfs2)
    
    # Obtain sequences of this list to check for premature stop codons:
    data = 'nucl/NC_000913.3_vs_%s.txt'%c
    cols = ['gene', 'subject', 'PID', 'alnLength', 'mismatchCount', 'gapOpenCount', 'queryStart', 'queryEnd', 'subjectStart', 'subjectEnd', 'eVal', 'bitScore']
    data = pd.read_csv(data, sep='\t', names=cols)
    #
    pseudogenes[c] = {}
    unannotated_data = data[data['gene'].isin(list(unannotated))]
    for i in unannotated_data.index:
        gene = data.loc[i,'gene']
        contig = data.loc[i,'subject'] 
        start = data.loc[i,'subjectStart']
        end = data.loc[i,'subjectEnd']
        seq = extract_seq('genomes/%s.fna'%c,contig, start, end)
        # check for early stop codons - these are likely nonfunctional and shouldn't be included
        if '*' in seq:
            print (seq)
            pseudogenes[c][gene]=seq
            # Remove the gene from list of unannotated genes
            unannotated-set([gene])
            
    
    print (c, unannotated)
    
    # For pertinent genes, retain those based off of nucleotide similarity within the orthology matrix and geneIDs matrix
    ortho_matrix_w_unannotated.loc[unannotated,c]=1
    for g in unannotated:
        geneIDs_matrix_w_unannotated.loc[g,c] = '%s_ortholog'%g
    

CP000946.1 {'b4321', 'b0973', 'b0516', 'b3577', 'b1621', 'b1817', 'b1616', 'b2483', 'b0030', 'b4513', 'b1771'}
CU651637.1 {'b4321', 'b2930', 'b2344', 'b2690', 'b2430', 'b4513', 'b1588', 'b0150'}
CP002167.1 {'b4321', 'b2930', 'b4086', 'b1587', 'b0936', 'b2519', 'b2430', 'b3715', 'b4513', 'b3579', 'b1897'}
CU928163.2 {'b4513', 'b4515'}
CU928164.2 {'b4513', 'b4515'}


In [54]:
#Save the Presence/Absence Matrix and geneIDs Matrix for future use
ortho_matrix_w_unannotated.to_csv('ortho_matrix.csv')
geneIDs_matrix_w_unannotated.to_csv('geneIDs_matrix.csv')