The goal of this code is to guess the catalytic residues of a family 1 glycoside hydrolase and generate a `params.json` file for docking and deep mutational scanning. 

This is the format of the `params.json` (all the info that is required to go from a comparative model to a docked and scanned array of models): 

```json
params = {
    'handle': '2jie',
    'nucleophile': 353,
    'acid_base': 164,
    'backup': 295,
    'ligand': 446,
}
```

### Ideas for how to do this 

I think one way would be to take a Bayseian approach. For example, we can put Gaussian priors over the sequence, one for each of the catalytic residues we are trying to guess. 

![](priors.png)

Then, we can use the information from a multiple sequence alignment to improve our guesses until we can make a sensible one. 

One easy way to do this would be to use a sequence whose catalytic residues you know. If a residue aligns to one of those, it is a pretty good guess. 

In [1]:
known_sequence = '''
>BglB
NTFIFPATFMWGTSTSSYQIEGGTDEGGRTPSIWDTFCQIPGKVIGGDCGDVACDHFHHF
KEDVQLMKQLGFLHYRFSVAWPRIMPAAGIINEEGLLFYEHLLDEIELAGLIPMLTLYHW
DLPQWIEDEGGWTQRETIQHFKTYASVIMDRFGERINWWNTINEPYCASILGYGTGEHAP
GHENWREAFTAAHHILMCHGIASNLHKEKGLTGKIGITLNMEHVDAASERPEDVAAAIRR
DGFINRWFAEPLFNGKYPEDMVEWYGTYLNGLDFVQPGDMELIQQPGDFLGINYYTRSII
RSTNDASLLQVEQVHMEEPVTDMGWEIHPESFYKLLTRIEKDFSKGLPILITENGAAMRD
ELVNGQIEDTGRHGYIEEHLKACHRFIEEGGQLKGYFVWSFLDNFEWAWGYSKRFGIVHI
NYETQERTPKQSALWFKQMMAKNGFGSLE
'''

params = {
    'handle': '2jie',
    'nucleophile': 353,
    'acid_base': 164,
    'backup': 295,
    'ligand': 446,
}

with open('known.fa', 'w') as fn:
    fn.write(known_sequence)

In [2]:
!blastp -subject known.fa -query example_input_files/target.fasta -outfmt "6 sseq qseq sstart qstart" > blast.out 

In [3]:
! cat blast.out

FMWGTSTSSYQIEGGTDEGGR---TPSIWDTFCQIP----GKVIGGDCGDVACDHFHHFKEDVQLMKQLGFLHYRFSVAWPRIMPAAG------------------------------IINEEGLLFYEHLLDEIELAGLIPMLTLYHWDLPQWIED------------EGGWTQRETIQHFKTYASVIMDRFGERINWWNTINEPYCASILGYGTGEHA--PGHENWREAFTAAHHILMCHGIASNLHKEKGLTGK-IGITLNM-EHVDAASERPEDVAAAIRRDGFINRWFAEPLFNGKYPEDMVEWYGTYLNGLDFVQPGDMELIQQPGDFLGINYYTRSIIRSTNDASLLQVEQVHMEE----------------PVTDMGWEIHPESFYKLLTRIEKDFSKGLPILITENGAAMRDELVNGQIEDTGRHGYIEEHLKACHRFIEEGGQLKGYFVWSFLDNFEWAWGYSKRFGIVHINYETQERTPKQSALWFKQMMAKNGF	FAWGVVQSAFQFEMG-DPLRRFIDTRTDWWHWVRDPLNIKNDLVSGHLPEDGINNYGLYEIDHQLAKDMGLNAYQITVEWSRIFPCPTYGVEVDFERDSYGLIKRVKITKETLHELEEIANAKEVEHYREVLKNLKELGFSTFVTLNHQTQPIWLHDPIHVRENFEKARAKGWVDERAILEFAKFAAFVAWKRWDLVDYWATFDEPMVTVELGYLAPYVGWPPGILNPKAAKAVIINQLVGHARA--YEAVKTFSDKPVGIILNIIPAYPRDPNDPKDVKATENYDLFHNRIFLDGVNEGKVDLDFD---GNYVK-IDHLKRND---------WIGNNYYTREVISTRNPNTRSSDNKLRGDEGYGYSSEPNSVSKDNNPTSDFGWECFPQGMYDSIM-IGNEYRK--PIYITENGIADSRDLL--------RPRYIKEHVEKMFEAIQAGADVRGYFHWALTDNYEWAMGFKIKFGLYEVDPISKQRIPRPR

In [14]:
with open('blast.out') as fn:
    sseq, qseq, sstart, qstart = fn.readline().split()
sstart = int(sstart) 

In [15]:
sseq

'FMWGTSTSSYQIEGGTDEGGR---TPSIWDTFCQIP----GKVIGGDCGDVACDHFHHFKEDVQLMKQLGFLHYRFSVAWPRIMPAAG------------------------------IINEEGLLFYEHLLDEIELAGLIPMLTLYHWDLPQWIED------------EGGWTQRETIQHFKTYASVIMDRFGERINWWNTINEPYCASILGYGTGEHA--PGHENWREAFTAAHHILMCHGIASNLHKEKGLTGK-IGITLNM-EHVDAASERPEDVAAAIRRDGFINRWFAEPLFNGKYPEDMVEWYGTYLNGLDFVQPGDMELIQQPGDFLGINYYTRSIIRSTNDASLLQVEQVHMEE----------------PVTDMGWEIHPESFYKLLTRIEKDFSKGLPILITENGAAMRDELVNGQIEDTGRHGYIEEHLKACHRFIEEGGQLKGYFVWSFLDNFEWAWGYSKRFGIVHINYETQERTPKQSALWFKQMMAKNGF'

In [16]:
print('Our catalytic residue positions will be shifted by {}'.format(sstart))

Our catalytic residue positions will be shifted by 9


In [27]:
mmap = {}
scount = qcount = 0 
for i, j in zip(sseq, qseq):
    
    if i != '-':
        scount += 1 
        
    if j != '_':
        qcount += 1
        
    sstart = int(sstart)
    qstart = int(qstart) 
    
    subject_index = scount + sstart - 1
    query_index = qcount + qstart - 1
    
    if subject_index in params.values():
        if i == j: 
            print('Mapping:', i, j, subject_index, query_index) 
            mmap.update({subject_index: query_index})
        else:
            print("Aligned residues aren't the same residue")
        
# scount, qcount 
mmap 

Mapping: E E 164 208
Mapping: Y Y 295 343
Mapping: E E 353 417


{164: 208, 295: 343, 353: 417}