I restructure a bit of the code here and also some of the data. The researchers have not been consistent in the enumeration of the peptides neither the domains so its kinda difficult to structure it all together. 

In [1]:
import os 
os.chdir('E:\Ecole\Year 3\Projet 3A')
import pandas as pd
import numpy as np 

class Domain:
    
    def __init__(self, name):
        self.name = name
        self.thresholds = None
        self.thetas = None

class Peptide:
    
    def __init__(self, name):
        self.name = name
        self.sequence = None
        self.sequence_bis = None ##Sequence bis are the last five amino acids
        self.energy_ground = 0.0 ##Anticipating the calculation of a ground state energy for the peptide
        
class Data:
    
    def __init__(self):
        temp_df = pd.read_excel('Data_PDZ/MDSM_01_stiffler_bis.xls')
        self.aminoacids = [acid.encode('utf-8') for acid in list(temp_df.columns[:20])]
        self.df = temp_df.T
        self.domains = [Domain(domain.encode('utf-8')) for domain in list(self.df.columns)]
        self.domain_names = [domain.name for domain in self.domains]
        self.pep_seqs = []
        self.pep_names = []
        with open('Data_PDZ/peptides.free') as f:
            for line in f:
                x = line.split()
                self.pep_seqs.append(x[1])
                self.pep_names.append(x[0])
        self.peptides = [Peptide(name) for name in self.pep_names]
        
    def create_domains(self):
        for domain in self.domains:
            domain.thetas = self.df[domain.name][:100]
            domain.thetas = np.asarray(domain.thetas)
            domain.thetas = domain.thetas.reshape(5,20)
            domain.thresholds = np.asarray(self.df[domain.name][100:])   
    
    def create_peptides(self):
        for i in range(len(self.pep_seqs)):
            self.peptides[i].sequence = self.pep_seqs[i]
            self.peptides[i].sequence_bis = list(self.pep_seqs[i])[5:]        

In [2]:
PDZ_Data = Data()

In [3]:
PDZ_Data.create_domains()
PDZ_Data.create_peptides()

In [4]:
PDZ_Data.peptides[10].sequence_bis

['D', 'D', 'L', 'E', 'I']

Now we have created the preliminary data with the binding energy values and the peptide sequences. The last thing left to do is to get the data from the interaction matrix for each of the domain

In [5]:
fp_interaction_matrix = pd.read_excel('Data_PDZ/fp_interaction_matrix.xlsx')
for column in fp_interaction_matrix.columns:
    fp_interaction_matrix.loc[fp_interaction_matrix[column] == 0.0, column] = -1.0
fp_interaction_matrix = fp_interaction_matrix.rename(columns=lambda x: str(x).replace(" ", ""))

In [6]:
def evaluate_score(domain, peptide):
    score = 0.0
    for i in range(5):
        j = PDZ_Data.aminoacids.index(peptide.sequence_bis[i])
        score += domain.thetas[i,j]
    return score - domain.thresholds[0]
    

In [7]:
evaluate_score(PDZ_Data.domains[16], PDZ_Data.peptides[9])

10.72625

In [8]:
def sigmoid(x, a=1):
    return 1.0/(1+np.exp(-1.0*a*x))
def log_modified(x):
    if x > 0:
        return np.log(1+np.exp(-x))
    else:
        return -x + np.log(1+np.exp(x))

Let us take one particular ligand and make mutations to this ligand. 

In [9]:
test_peptide = PDZ_Data.peptides[3]
print test_peptide.name

ASIC2


In [10]:
print test_peptide.sequence_bis

['E', 'E', 'I', 'A', 'C']


Let us calculate the **energy** associated for each peptide in our data set. Once calculated for one peptide we shall calculate it for all the peptides in our data set. These values would then also be considered as fixed for the purposes of modeling the robustness of the specificity of the peptide-domain interaction. 

In [11]:
score_natural = 0.0
print test_peptide.name
for i in range(len(PDZ_Data.domain_names)):
    temp = evaluate_score(PDZ_Data.domains[i], test_peptide)
    alpha = fp_interaction_matrix[test_peptide.name][i]
    ## As a sanity check we print the values of alpha as well
    ## We remark that ASIC2 doesnt bind to any of the PDZ Domains that we consider and thus all values should be -1
    #print alpha
    if alpha > 0:
        alpha = +1.0
    score = temp*alpha
    temp2 = log_modified(score)
    score_natural += temp2 
print score_natural

ASIC2
2.58576358448


Now that we have calculated the energies for one peptide, let us calculate the ground state energies for all the peptides in the system. We shall write a simple function which does this given a peptide

In [12]:
def evaluate_energy(peptide):
    score_natural = 0.0
    for i in range(len(PDZ_Data.domain_names)): 
        temp = evaluate_score(PDZ_Data.domains[i], peptide)
        alpha = fp_interaction_matrix[peptide.name][i]
        if alpha > 0:
            alpha = +1.0
        score = temp*alpha
        temp2 = log_modified(score)
        score_natural += temp2 
    return score_natural

In [13]:
for pep in PDZ_Data.peptides:
    pep.energy_ground = evaluate_energy(pep)

In [14]:
#for pep in PDZ_Data.peptides:
   # print pep.name, pep.energy_ground


## Simulation 
Now we shall start with the real Monte Carlo step. Our algorithm is based on the famous Metropolis algorithm. We start with a given peptide and its sequence. To each peptide is associated a particular energy(which we calculated above). We expect that under mutations of the sequence, this energy will change. Depending on whether the energy changes or not after a point mutation, we shall accept or reject the mutation. 

Let us first start off by writing some convenience functions to make point mutations

**UPDATE**:

After some problems with the data, mainly to do with the way the data was indexed by the researchers, we are finally in a position to retrieve something meaningful out of the data. 

For an initial run, the acceptance rule for the mutations was rather simple: If the energy reduced we would accept the mutation, otherwise not. Surprisingly, with such a rule, no new mutations which conserve the original binding pairs, are allowed. Further exploration is needed at this stage to completely deconstruct this finding.

To facilitate further analysis, I compute the energies and the scores for each of the peptide-PDZ pairs. These energies will serve as a reference for further mutations. 

In [15]:
def convert2seq(seq_int):
    return [PDZ_Data.aminoacids[i] for i in seq_int]
def convert2int(seq_pep):
    return [PDZ_Data.aminoacids.index(pep) for pep in seq_pep]

Lets take a peptide like one from the Claudin family. The advantage with the claudin family is that they bind more than one of the given PDZ domains. Let us take Claudin14 

In [16]:
index =  PDZ_Data.pep_names.index('AcvR1')

In [17]:
test_peptide = PDZ_Data.peptides[index]

In [18]:
print test_peptide.name
print test_peptide.sequence_bis
print test_peptide.energy_ground

AcvR1
['L', 'K', 'T', 'D', 'C']
8.1782910361


In [19]:
base_seq = convert2int(test_peptide.sequence_bis)

In [20]:
print base_seq
print PDZ_Data.aminoacids

[3, 15, 10, 18, 14]
['G', 'A', 'V', 'L', 'I', 'M', 'P', 'F', 'W', 'S', 'T', 'N', 'Q', 'Y', 'C', 'K', 'R', 'H', 'D', 'E']


To make a mutation we need two numbers, one a number between 0 and 4 which will tell us the position to be mutated and a number between 0 and 19 which will tell us the amino acid to put in that position. We can do this easily by making two calls to the randomint function in numpy. 

In [21]:
y = np.random.randint(5)
z = np.random.randint(20)
print y, z

3 7


In [22]:
mut_seq = base_seq
mut_seq[y] = z

In [23]:
print mut_seq
print convert2seq(mut_seq)

[3, 15, 10, 7, 14]
['L', 'K', 'T', 'F', 'C']


In [24]:
def eval_score(domain, sequence):
    score = 0.0
    for i in range(5):
        score += domain.thetas[i,sequence[i]]
    return score - domain.thresholds[0]
print evaluate_score(PDZ_Data.domains[15], PDZ_Data.peptides[index])
## Sanity Check
temp = PDZ_Data.peptides[index]
print eval_score(PDZ_Data.domains[15], convert2int(temp.sequence_bis))
print PDZ_Data.domain_names[15]

-6.74577
-6.74577
Harmonin (2/3)


In [25]:
def eval_energy(peptide, sequence):
    score_natural = 0.0
    for i in range(len(PDZ_Data.domain_names)): 
        temp = eval_score(PDZ_Data.domains[i], sequence)
        alpha = fp_interaction_matrix[peptide.name][i]
        #print PDZ_Data.domain_names[i], alpha
        ##Save the individual energies 
        if alpha > 0:
            alpha = +1.0
        score = temp*alpha
        temp2 = log_modified(score)
        score_natural += temp2 
    return score_natural
##Sanity Check
print eval_energy(test_peptide, convert2int(test_peptide.sequence_bis))
print test_peptide.energy_ground

8.1782910361
8.1782910361


In [26]:
print eval_energy(test_peptide, mut_seq)

48.7835771261


This is a basic run where we introduce mutations and see whether the energy actually reduces or not. The base energy is 8.178291

In [27]:
Nb_runs = 25
mutated_sequences = []
mutated_energies = []
print "Initial Configuration"
print convert2int(test_peptide.sequence_bis), test_peptide.sequence_bis, test_peptide.energy_ground
print "Results after simulations"
for i in range(Nb_runs+1):
    y = np.random.randint(5)
    z = np.random.randint(20)
    mut_seq[y] = z
    mutated_sequences.append(list(mut_seq))
    energy = eval_energy(test_peptide, mut_seq)
    if energy < test_peptide.energy_ground:
        print mut_seq, convert2seq(mut_seq), energy
    mutated_energies.append(eval_energy(test_peptide, mut_seq))

Initial Configuration
[3, 15, 10, 18, 14] ['L', 'K', 'T', 'D', 'C'] 8.1782910361
Results after simulations
[2, 15, 10, 2, 14] ['V', 'K', 'T', 'V', 'C'] 6.09547013008
[2, 6, 10, 2, 14] ['V', 'P', 'T', 'V', 'C'] 8.1257352667
[2, 6, 10, 2, 16] ['V', 'P', 'T', 'V', 'R'] 8.06161878859
[1, 6, 10, 5, 16] ['A', 'P', 'T', 'M', 'R'] 7.94801221753
[1, 6, 19, 5, 16] ['A', 'P', 'E', 'M', 'R'] 7.48035034333
[1, 6, 19, 5, 16] ['A', 'P', 'E', 'M', 'R'] 7.48035034333
[6, 6, 19, 5, 16] ['P', 'P', 'E', 'M', 'R'] 3.64712837068
[6, 13, 19, 5, 16] ['P', 'Y', 'E', 'M', 'R'] 6.57533606018
[4, 13, 19, 4, 9] ['I', 'Y', 'E', 'I', 'S'] 7.68437305514


We now run the Metropolis Hastings algorithm for our data set. If the energy decreases then it is a favorable mutation and we accept the mutation, otherwise we reject the mutation

In [28]:
sims = []
Nb_runs = 1000
base_seq = convert2int(test_peptide.sequence_bis)
for j in range(10):
    print "Run Number: {}".format(j+1)
    sim_results = []
    mutated_sequences_bis = []
    mutated_energies_bis = []
    mut_seq = base_seq
    mut_energy = test_peptide.energy_ground
    for i in range(Nb_runs):
        y = np.random.randint(5)
        z = np.random.randint(20)
        temp_seq = mut_seq
        temp_seq[y] = z
        temp_energy = eval_energy(test_peptide, temp_seq)
        if temp_energy < mut_energy:
            mut_energy = temp_energy
            mut_seq = temp_seq
            sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 1})
            print "Accepted {} {} {} ".format(temp_seq, temp_energy, convert2seq(temp_seq))
        else:
            sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 0})
            ##print "Rejected {} {}".format(temp_seq, temp_energy)
        mutated_sequences_bis.append(temp_seq)
        mutated_energies_bis.append(temp_energy)
    sims.append(sim_results)
print "Ground State config {} {}".format(test_peptide.energy_ground, test_peptide.sequence_bis)

Run Number: 1
Accepted [3, 15, 16, 18, 14] 2.95285700208 ['L', 'K', 'R', 'D', 'C'] 
Accepted [11, 19, 9, 6, 15] 2.14258294569 ['N', 'E', 'S', 'P', 'K'] 
Accepted [6, 11, 16, 18, 0] 1.42033234874 ['P', 'N', 'R', 'D', 'G'] 
Accepted [6, 11, 12, 18, 0] 1.18595950809 ['P', 'N', 'Q', 'D', 'G'] 
Accepted [17, 19, 2, 11, 8] 1.03898661624 ['H', 'E', 'V', 'N', 'W'] 
Accepted [16, 18, 2, 18, 8] 1.00814188279 ['R', 'D', 'V', 'D', 'W'] 
Run Number: 2
Accepted [19, 1, 4, 14, 12] 4.78189533589 ['E', 'A', 'I', 'C', 'Q'] 
Accepted [14, 1, 4, 14, 12] 4.17985474375 ['C', 'A', 'I', 'C', 'Q'] 
Accepted [8, 12, 16, 10, 14] 2.97236779055 ['W', 'Q', 'R', 'T', 'C'] 
Accepted [3, 14, 6, 1, 17] 2.51471739741 ['L', 'C', 'P', 'A', 'H'] 
Accepted [0, 0, 6, 1, 17] 2.14371512431 ['G', 'G', 'P', 'A', 'H'] 
Accepted [4, 17, 19, 1, 1] 1.81470565807 ['I', 'H', 'E', 'A', 'A'] 
Accepted [19, 19, 1, 11, 13] 1.76987818715 ['E', 'E', 'A', 'N', 'Y'] 
Accepted [16, 18, 5, 0, 9] 1.72707573236 ['R', 'D', 'M', 'G', 'S'] 
Accepted

Here we see that there is no mutation that acts in a manner to reduce the energy corresponding to the peptide. Of course, a useful question to ask is whether for individual PDZ's whether these mutations increase the score or not. 

In [29]:
print np.min(np.asarray(mutated_energies_bis))
print np.argmin(np.asarray(mutated_energies_bis))
print convert2seq(mutated_sequences_bis[np.argmin(np.asarray(mutated_energies_bis))])

1.22418536355
607
['Y', 'H', 'V', 'I', 'N']


In [139]:
def eval_energy_2(peptide, sequence):
    score_natural = 0.0
    energies = []
    for i in range(len(PDZ_Data.domain_names)): 
        temp = eval_score(PDZ_Data.domains[i], sequence)
        alpha = fp_interaction_matrix[peptide.name][i]
        if alpha > 0:
            alpha = +1.0
        score = temp*alpha
        temp2 = log_modified(score)
        energies.append({'Energy': temp2, 'alpha': alpha, 'score': temp})
        score_natural += temp2 
    return energies

In [140]:
x= eval_energy_2(test_peptide, convert2int(test_peptide.sequence_bis))

In [155]:
neg = []
pos = [] 
for i in range(len(x)):
    if x[i]['alpha'] == -1.0:
        neg.append({'data':x[i], 'name': PDZ_Data.domain_names[i], 'index' : i+2})
    else:
        pos.append({'data':x[i], 'name': PDZ_Data.domain_names[i], 'index' : i+2})
for i in range(len(neg)):
    print neg[i]


{'index': 2, 'data': {'alpha': -1.0, 'Energy': 0.00033047741288672462, 'score': -8.0148070000000011}, 'name': 'Cipp (03/10)'}
{'index': 3, 'data': {'alpha': -1.0, 'Energy': 4.4945943813552991e-07, 'score': -14.615220000000001}, 'name': 'Cipp (05/10)'}
{'index': 4, 'data': {'alpha': -1.0, 'Energy': 0.36868588401169605, 'score': -0.80781000000000036}, 'name': 'Cipp (08/10)'}
{'index': 5, 'data': {'alpha': -1.0, 'Energy': 2.7643116239509001e-05, 'score': -10.496119999999999}, 'name': 'Cipp (09/10)'}
{'index': 6, 'data': {'alpha': -1.0, 'Energy': 1.6221456479634161e-06, 'score': -13.331760000000001}, 'name': 'Cipp (10/10)'}
{'index': 7, 'data': {'alpha': -1.0, 'Energy': 5.1771544900187074e-05, 'score': -9.8686439999999997}, 'name': 'D930005D10Rik (1/1)'}
{'index': 8, 'data': {'alpha': -1.0, 'Energy': 0.00010797595565362467, 'score': -9.1335480000000011}, 'name': 'Dlgh3 (1/1)'}
{'index': 9, 'data': {'alpha': -1.0, 'Energy': 3.3629880402268024e-06, 'score': -12.602679}, 'name': 'Dvl1 (1/1)'}

In [163]:
for i in range(len(pos)):
    print pos[i]

Here we have seen that none of the mutations generated result in the lowering of the energy suggesting that the specificity has already been optimized. We could run a variant of the algorithm whereby we accept energetic mutations with a certain probability. 