**Author:** Dan Shea  
**Date:** 2019.11.15  
**Description:**  
#### Best Reciprocal `blastp` Hit
Given amino acid sequences for four _Brassicaceae_ organisms (2 files for _B. rapa_ --version 1.5 & version 3), we will run pairwise `blastp` and report the best hit for each of the 20 possible pairings (20 and not 10 because here order is important). The results will then be merged for pairings that show reciprocity. Those genes that do not show reciprocity for their blast results will be saved in another file. We will make use of `Bio.Blast.Applications` which can run command line blastp from within a python program using `NcbiblastxCommandline`.

__UPDATE__ - It turns out that that multithreading is not being used when blastp is invoked via `NcbiblastxCommandline`. So I have a script called `run_blast.sh` (what else! ;-) This script fires off blastp sequentially and asks for 10 threads. We will import the XML output in this workbook to determine the reciprocal best hits.

In [1]:
from Bio.Blast import NCBIXML
import os
import os.path
import itertools
import operator

In [2]:
# See UPDATE
# def run_blastp(queryfile, subjectfile):
#     query = os.path.basename(queryfile).split('.faa')[0]
#     subject = os.path.basename(subjectfile).split('.faa')[0]
#     outfile = '_vs_'.join([query, subject]) + '.xml'
#     # Construct the command line to run such that we only return the best hit (max_target_seqs=1)
#     blastp_cline = NcbiblastxCommandline(cmd='blastp', out=outfile, outfmt=5, query=queryfile, subject=subjectfile, max_target_seqs=1, num_threads=10)
#     # invoke the blastp command
#     blastp_cline()

In [3]:
inputfiles = ['A_thaliana.faa', 'B_napus.faa', 'B_oleracea.faa', 'B_rapa.faa', 'B_rapa_v1.5.faa']
inputfiles = list(itertools.starmap(operator.add, itertools.product(['references/'],inputfiles)))
idx = 0
combinations = list()
for ifile in inputfiles:
    combinations.extend(list(itertools.product([ifile], inputfiles[0:idx]+inputfiles[idx+1:])))
    idx += 1

In [4]:
# See UPDATE
# for combination in combinations:
#     run_blastp(*combination)

In [5]:
combinations

[('references/A_thaliana.faa', 'references/B_napus.faa'),
 ('references/A_thaliana.faa', 'references/B_oleracea.faa'),
 ('references/A_thaliana.faa', 'references/B_rapa.faa'),
 ('references/A_thaliana.faa', 'references/B_rapa_v1.5.faa'),
 ('references/B_napus.faa', 'references/A_thaliana.faa'),
 ('references/B_napus.faa', 'references/B_oleracea.faa'),
 ('references/B_napus.faa', 'references/B_rapa.faa'),
 ('references/B_napus.faa', 'references/B_rapa_v1.5.faa'),
 ('references/B_oleracea.faa', 'references/A_thaliana.faa'),
 ('references/B_oleracea.faa', 'references/B_napus.faa'),
 ('references/B_oleracea.faa', 'references/B_rapa.faa'),
 ('references/B_oleracea.faa', 'references/B_rapa_v1.5.faa'),
 ('references/B_rapa.faa', 'references/A_thaliana.faa'),
 ('references/B_rapa.faa', 'references/B_napus.faa'),
 ('references/B_rapa.faa', 'references/B_oleracea.faa'),
 ('references/B_rapa.faa', 'references/B_rapa_v1.5.faa'),
 ('references/B_rapa_v1.5.faa', 'references/A_thaliana.faa'),
 ('refe

#### Alrighty then! From here, we will parse the XML output generated by `blastp` and construct our BRBHs for each pairing
`blastp` produced `.xml` files as output (`-outfmt 5`). Now, we will make use of the blast output parser in `Bio.Blast.NCBIXML`

In [6]:
blastp_files = ['A_thaliana_B_napus.xml','A_thaliana_B_oleracea.xml','A_thaliana_B_rapa.xml','A_thaliana_B_rapa_v1.5.xml',
                'B_napus_A_thaliana.xml','B_napus_B_oleracea.xml','B_napus_B_rapa.xml','B_napus_B_rapa_v1.5.xml',
                'B_oleracea_A_thaliana.xml','B_oleracea_B_napus.xml','B_oleracea_B_rapa.xml','B_oleracea_B_rapa_v1.5.xml',
                'B_rapa_A_thaliana.xml','B_rapa_B_napus.xml','B_rapa_B_oleracea.xml','B_rapa_B_rapa_v1.5.xml',
                'B_rapa_v1.5_A_thaliana.xml','B_rapa_v1.5_B_napus.xml','B_rapa_v1.5_B_oleracea.xml','B_rapa_v1.5_B_rapa.xml',]

In [7]:
def parse_blast(filename):
    results = list()
    with open(filename, 'r') as fh:
        blast_records = NCBIXML.parse(fh)
        for blast_record in blast_records:
            # Only use the gene names to ensure the later reciprocal lookup will work
            # Found there are extra spaces in B. oleracea output that was causing lookup to fail
            query = blast_record.query.split(' ')[0]
            if len(blast_record.alignments) > 0:
                hit = blast_record.alignments[0].hit_def.split(' ')[0]
                results.append([query, hit])
            else:
                results.append([query, 'NA'])
    return results

In [8]:
# Let's construct keys for the resultsa dictionary
blast_result_keys = list()
for a, b in combinations:
    a = os.path.basename(a).split('.faa')[0]
    b = os.path.basename(b).split('.faa')[0]
    blast_result_keys.append(':'.join([a,b]))

In [9]:
blast_result_keys

['A_thaliana:B_napus',
 'A_thaliana:B_oleracea',
 'A_thaliana:B_rapa',
 'A_thaliana:B_rapa_v1.5',
 'B_napus:A_thaliana',
 'B_napus:B_oleracea',
 'B_napus:B_rapa',
 'B_napus:B_rapa_v1.5',
 'B_oleracea:A_thaliana',
 'B_oleracea:B_napus',
 'B_oleracea:B_rapa',
 'B_oleracea:B_rapa_v1.5',
 'B_rapa:A_thaliana',
 'B_rapa:B_napus',
 'B_rapa:B_oleracea',
 'B_rapa:B_rapa_v1.5',
 'B_rapa_v1.5:A_thaliana',
 'B_rapa_v1.5:B_napus',
 'B_rapa_v1.5:B_oleracea',
 'B_rapa_v1.5:B_rapa']

In [10]:
# Now parse the results and store them in the dictionary
blast_result_dict = dict()
for key in blast_result_keys:
    blastp_file = '_'.join(key.split(':')) + '.xml'
    blast_result_dict[key] = parse_blast(blastp_file)

In [11]:
reciprocal_hits = dict()
checked = [0 for _ in range(len(blast_result_keys))]
A_idx = 0
for key in blast_result_keys:
    # A is the current key
    A = key
    # B is the the corresponding reverse comparison key
    B_idx = blast_result_keys.index(':'.join(reversed(key.split(':'))))
    B = blast_result_keys[B_idx]
    # If they have not alread been checked, check them otherwise move to the next pair
    if (checked[A_idx] == 0) and (checked[B_idx] == 0):
        # mark them as having been checked
        checked[A_idx] = 1
        checked[B_idx] = 1
        print(f'Checking {A} and {B}')
        # find the reciprocal best blast hits
        for A_hit in blast_result_dict[A]:
            # For each hit in A we try to lookup the reverse in B
            # if we get a ValueError, that means that there is no reciprocal best hit
            try:
                B_lookup = list(reversed(A_hit))
                B_hit = blast_result_dict[B].index(B_lookup)
                if A in reciprocal_hits:
                    reciprocal_hits[A].append(A_hit)
                else:
                    reciprocal_hits[A] = [A_hit]
            except ValueError as e:
                pass
    else:
        pass
    A_idx += 1


Checking A_thaliana:B_napus and B_napus:A_thaliana
Checking A_thaliana:B_oleracea and B_oleracea:A_thaliana
Checking A_thaliana:B_rapa and B_rapa:A_thaliana
Checking A_thaliana:B_rapa_v1.5 and B_rapa_v1.5:A_thaliana
Checking B_napus:B_oleracea and B_oleracea:B_napus
Checking B_napus:B_rapa and B_rapa:B_napus
Checking B_napus:B_rapa_v1.5 and B_rapa_v1.5:B_napus
Checking B_oleracea:B_rapa and B_rapa:B_oleracea
Checking B_oleracea:B_rapa_v1.5 and B_rapa_v1.5:B_oleracea
Checking B_rapa:B_rapa_v1.5 and B_rapa_v1.5:B_rapa


In [12]:
reciprocal_hits.keys()

dict_keys(['A_thaliana:B_napus', 'A_thaliana:B_oleracea', 'A_thaliana:B_rapa', 'A_thaliana:B_rapa_v1.5', 'B_napus:B_oleracea', 'B_napus:B_rapa', 'B_napus:B_rapa_v1.5', 'B_oleracea:B_rapa', 'B_oleracea:B_rapa_v1.5', 'B_rapa:B_rapa_v1.5'])

In [13]:
# Write out results to file
for k in reciprocal_hits:
    outfile = k + '.brbh.tsv'
    with open(outfile, 'w') as ofh:
        for entry in reciprocal_hits[k]:
            ofh.write('\t'.join(entry))
            ofh.write('\n')

In [None]:
# for k in blast_result_dict:
#     outfile = k + '.blastp.tsv'
#     with open(outfile, 'w') as ofh:
#         for entry in blast_result_dict[k]:
#             ofh.write('\t'.join(entry))
#             ofh.write('\n')