I restructure a bit of the code here and also some of the data. The researchers have not been consistent in the enumeration of the peptides neither the domains so its kinda difficult to structure it all together. 

**UPDATE**:

There was lots of redundant code that I had written for demonstration purposes. The code written henceforth is cleaner and more modular(I hope) 

In [1]:
import os 
os.chdir('E:\Ecole\Year 3\Projet 3A')
import pandas as pd
import numpy as np 

class Domain:
    
    def __init__(self, name):
        self.name = name
        self.thresholds = None
        self.thetas = None

class Peptide:
    
    def __init__(self, name):
        self.name = name
        self.sequence = None
        self.sequence_bis = None ##Sequence bis are the last five amino acids
        self.energy_ground = 0.0 ##Anticipating the calculation of a ground state energy for the peptide
        
class Data:
    
    def __init__(self):
        temp_df = pd.read_excel('Data_PDZ/MDSM_01_stiffler_bis.xls')
        self.aminoacids = [acid.encode('utf-8') for acid in list(temp_df.columns[:20])]
        self.df = temp_df.T
        self.domains = [Domain(domain.encode('utf-8')) for domain in list(self.df.columns)]
        self.domain_names = [domain.name for domain in self.domains]
        self.pep_seqs = []
        self.pep_names = []
        with open('Data_PDZ/peptides.free') as f:
            for line in f:
                x = line.split()
                self.pep_seqs.append(x[1])
                self.pep_names.append(x[0])
        self.peptides = [Peptide(name) for name in self.pep_names]
        
    def create_domains(self):
        for domain in self.domains:
            domain.thetas = self.df[domain.name][:100]
            domain.thetas = np.asarray(domain.thetas)
            domain.thetas = domain.thetas.reshape(5,20)
            domain.thresholds = np.asarray(self.df[domain.name][100:])   
    
    def create_peptides(self):
        for i in range(len(self.pep_seqs)):
            self.peptides[i].sequence = self.pep_seqs[i]
            self.peptides[i].sequence_bis = list(self.pep_seqs[i])[5:]        

In [2]:
PDZ_Data = Data()
PDZ_Data.create_domains()
PDZ_Data.create_peptides()

Now we have created the preliminary data with the binding energy values and the peptide sequences. The last thing left to do is to get the data from the interaction matrix for each of the domain. 

We also write some convenience functions like the log and sigmoid functions 

In [3]:
fp_interaction_matrix = pd.read_excel('Data_PDZ/fp_interaction_matrix.xlsx')
for column in fp_interaction_matrix.columns:
    fp_interaction_matrix.loc[fp_interaction_matrix[column] == 0.0, column] = -1.0
fp_interaction_matrix = fp_interaction_matrix.rename(columns=lambda x: str(x).replace(" ", ""))

In [4]:
## Sigmoid Function
def sigmoid(x, a=1):
    return 1.0/(1+np.exp(-1.0*a*x))
## Log(1+exp(-x)) 
## We take care of numerical stability for values of x < 0
def log_modified(x):
    if x > 0:
        return np.log(1+np.exp(-x))
    else:
        return -x + np.log(1+np.exp(x))
    
## Convenience functions to convert between letter sequences and indexed sequences. 
## The index for each amino acid is computed using the enumeration presented in the file "MDSM_01_stiffler_bis.xls" 
def convert2seq(seq_int):
    return [PDZ_Data.aminoacids[i] for i in seq_int]
def convert2int(seq_pep):
    return [PDZ_Data.aminoacids.index(pep) for pep in seq_pep]

In [5]:
## Unlike the previous code, where we had two different versions of evaluate_score function, here we have just one. 
## To enter the sequence of a known peptide use convert2int(peptide name) 
def eval_score(domain, sequence):
    score = 0.0
    for i in range(5):
        score += domain.thetas[i,sequence[i]]
    return score - domain.thresholds[0]

In [6]:
## Similarly for the energy evaluation we simply use the single version which use the sequence directly 
## The sequence argument will be useful when making mutations starting from the basic peptide sequence 
def eval_energy(peptide, sequence, verbose=0):
    score_natural = 0.0
    energies = []
    for i in range(len(PDZ_Data.domain_names)): 
        temp = eval_score(PDZ_Data.domains[i], sequence)
        alpha = fp_interaction_matrix[peptide.name][i]
        if alpha > 0:
            alpha = +1.0
        score = temp*alpha
        temp2 = log_modified(score)
        energies.append({'Energy': temp2, 'alpha': alpha, 'score': temp})
        score_natural += temp2 
    if verbose == 0:
        return score_natural
    else:
        return score_natural, energies

Let us take a particular example and illustrate the use of these basic functions on them. We shall use the protein *Caspr2*

In [7]:
ix = PDZ_Data.pep_names.index('Caspr2')
print ix
pep_demo = PDZ_Data.peptides[ix]
print pep_demo.name
print pep_demo.sequence
print pep_demo.sequence_bis

4
Caspr2
IDESKKEWLI
['K', 'E', 'W', 'L', 'I']


In [8]:
print eval_energy(pep_demo, convert2int(pep_demo.sequence_bis))
score, energies = eval_energy(pep_demo, convert2int(pep_demo.sequence_bis), verbose=1)

10.2327578819


In [9]:
neg = []
pos = [] 
for i in range(len(energies)):
    if energies[i]['alpha'] == -1.0:
        neg.append({'data':energies[i], 'name': PDZ_Data.domain_names[i], 'index' : i+2})
    else:
        pos.append({'data':energies[i], 'name': PDZ_Data.domain_names[i], 'index' : i+2})
for item in pos:
    print item

{'index': 2, 'data': {'alpha': 1.0, 'Energy': 0.58841515662593658, 'score': 0.22172999999999998}, 'name': 'Cipp (03/10)'}
{'index': 8, 'data': {'alpha': 1.0, 'Energy': 0.69316718075994543, 'score': -4.0000000000262048e-05}, 'name': 'Dlgh3 (1/1)'}
{'index': 19, 'data': {'alpha': 1.0, 'Energy': 2.5693183193875501e-05, 'score': 10.569272000000002}, 'name': 'HtrA3 (1/1)'}
{'index': 33, 'data': {'alpha': 1.0, 'Energy': 0.6931021815724453, 'score': 9.0000000000145519e-05}, 'name': 'Mpp7 (1/1)'}
{'index': 43, 'data': {'alpha': 1.0, 'Energy': 0.69316718075994543, 'score': -4.0000000000262048e-05}, 'name': 'PAR-3 (3/3)'}


In [10]:
## We calculate the natural energies for each of the peptides 
for pep in PDZ_Data.peptides:
    pep.energy_ground = eval_energy(pep, convert2int(pep.sequence_bis))

In [11]:
## Sanity Check 
print PDZ_Data.peptides[ix].energy_ground

10.2327578819



## Simulation 
Now we shall start with the real Monte Carlo step. Our algorithm is based on the famous Metropolis algorithm. We start with a given peptide and its sequence. To each peptide is associated a particular energy(which we calculated above). We expect that under mutations of the sequence, this energy will change. Depending on whether the energy changes or not after a point mutation, we shall accept or reject the mutation. 

Let us first start off by writing some convenience functions to make point mutations

**UPDATE**:

After some problems with the data, mainly to do with the way the data was indexed by the researchers, we are finally in a position to retrieve something meaningful out of the data. 

For an initial run, the acceptance rule for the mutations was rather simple: If the energy reduced we would accept the mutation, otherwise not. Surprisingly, with such a rule, no new mutations which conserve the original binding pairs, are allowed. Further exploration is needed at this stage to completely deconstruct this finding.

To facilitate further analysis, I compute the energies and the scores for each of the peptide-PDZ pairs. These energies will serve as a reference for further mutations. 

To make a mutation we need two numbers, one a number between 0 and 4 which will tell us the position to be mutated and a number between 0 and 19 which will tell us the amino acid to put in that position. We can do this easily by making two calls to the randomint function in numpy. 

In [12]:
y = np.random.randint(5)
z = np.random.randint(20)
print y, z

0 2


This is a basic run where we introduce mutations and see whether the energy actually reduces or not. The base energy is 8.178291

We now run the Metropolis Hastings algorithm for our data set. If the energy decreases then it is a favorable mutation and we accept the mutation, otherwise we reject the mutation

In [15]:
test_peptide = PDZ_Data.peptides[57]
print test_peptide.name
print test_peptide.energy_ground

AcvR1
8.1782910361


In [16]:
print " Uniform {} Test {}".format(np.random.uniform(), 2)

 Uniform 0.38313776258 Test 2


In [31]:
def run_mc(nb_runs, peptide, temp=0, nb_cycles=10):
    sims = []
    base_seq = convert2int(peptide.sequence_bis)
    for j in range (nb_cycles):
        print "\n Cycle number : {}\n".format(j+1)
        sim_results = []
        mutated_sequences = []
        mutated_energies = []
        mut_seq = base_seq
        mut_energy = peptide.energy_ground
        for i in range(nb_runs):
            y = np.random.randint(5)
            z = np.random.randint(20)
            temp_seq = mut_seq
            temp_seq[y] = z
            temp_energy = eval_energy(peptide, temp_seq)
            if temp == 0:
                if temp_energy < mut_energy:
                    mut_energy = temp_energy
                    mut_seq = temp_seq
                    sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 1})
                    print "Accepted {} {} {} ".format(temp_seq, temp_energy, convert2seq(temp_seq))
                else:
                    sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 0})  
            else:
                ratio = np.exp(-temp*(temp_energy-mut_energy))
                prob_trans = min(1, ratio)
                x = np.random.uniform()
                
                if x < prob_trans:
                    mut_energy = temp_energy
                    mut_seq = temp_seq
                    print "Run number: {}\n".format(i)
                    print "Uniform {} Ratio {} Prob_Trans {} ".format(x,ratio,prob_trans)
                    sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 1})
                    print "Accepted {} {} {} \n".format(temp_seq, temp_energy, convert2seq(temp_seq))
                else:
                    sim_results.append({'Sequence': temp_seq, 'Energy': temp_energy, 'Accepted': 0})  
                
                ##print "Rejected {} {}".format(temp_seq, temp_energy)
            mutated_sequences.append(temp_seq)
            mutated_energies.append(temp_energy)
        sims.append({'Results' : sim_results, 'Mutated sequences': mutated_sequences, 'Mutated Energies': mutated_energies}) 
    return sims

Here we perform mutations for the given peptide and only accept those mutations which lead to a reduction in the energy. Here the inverse temperature $\beta$ is infinite (or $T=0$)

In [32]:
results = run_mc(100, test_peptide)


 Cycle number : 1

Accepted [11, 16, 10, 18, 14] 6.65560710528 ['N', 'R', 'T', 'D', 'C'] 
Accepted [16, 17, 6, 18, 12] 1.44898291458 ['R', 'H', 'P', 'D', 'Q'] 
Accepted [16, 17, 6, 18, 5] 1.02989141805 ['R', 'H', 'P', 'D', 'M'] 

 Cycle number : 2

Accepted [3, 19, 4, 1, 15] 4.31746186881 ['L', 'E', 'I', 'A', 'K'] 
Accepted [3, 19, 15, 1, 15] 3.60813399749 ['L', 'E', 'K', 'A', 'K'] 
Accepted [3, 10, 15, 1, 15] 3.44158450157 ['L', 'T', 'K', 'A', 'K'] 
Accepted [3, 1, 15, 1, 15] 2.48514668817 ['L', 'A', 'K', 'A', 'K'] 

 Cycle number : 3

Accepted [9, 1, 4, 5, 9] 6.26835589215 ['S', 'A', 'I', 'M', 'S'] 
Accepted [9, 1, 4, 5, 15] 6.18247360924 ['S', 'A', 'I', 'M', 'K'] 
Accepted [9, 1, 0, 5, 15] 3.83118907759 ['S', 'A', 'G', 'M', 'K'] 
Accepted [15, 1, 15, 12, 14] 2.80608570073 ['K', 'A', 'K', 'Q', 'C'] 
Accepted [8, 19, 12, 9, 15] 2.36774360313 ['W', 'E', 'Q', 'S', 'K'] 

 Cycle number : 4

Accepted [6, 19, 19, 15, 1] 2.49064152902 ['P', 'E', 'E', 'K', 'A'] 

 Cycle number : 5

Accepted

We shall now run the simulation with a rejection probability proportional to $e^{-\beta(E_{new} - E_{old})}$ where $\beta$ is the inverse temperature and $E_{new}$ and $E_{old}$ are the energies after and before the mutation. 

In [33]:
results_2 = run_mc(100, test_peptide, temp=1.0, nb_cycles=1)


 Cycle number : 1

Run number: 2

Uniform 0.166396099846 Ratio 39.242240388 Prob_Trans 1 
Accepted [15, 6, 10, 18, 14] 4.50853730854 ['K', 'P', 'T', 'D', 'C'] 

Run number: 27

Uniform 0.31489971446 Ratio 2.50758533259 Prob_Trans 1 
Accepted [11, 11, 5, 14, 15] 3.58921703732 ['N', 'N', 'M', 'C', 'K'] 

Run number: 29

Uniform 0.366720151128 Ratio 1.35675690995 Prob_Trans 1 
Accepted [1, 11, 5, 14, 15] 3.28411981035 ['A', 'N', 'M', 'C', 'K'] 

Run number: 77

Uniform 0.469839643936 Ratio 2.86214981816 Prob_Trans 1 
Accepted [9, 19, 3, 9, 13] 2.23254678322 ['S', 'E', 'L', 'S', 'Y'] 



Let us now consider a peptide which according to the FP Interaction Matrix binds to a lot many domains and see how its behavior is different from that of the peptide *AcvR1*. For the next demo, we shall consider *Cnksr2*

In [34]:
test_peptide2 = PDZ_Data.peptides[PDZ_Data.pep_names.index('Cnksr2')]
print test_peptide2.name
print test_peptide2.energy_ground
print test_peptide2.sequence_bis

Cnksr2
6.91618770844
['I', 'E', 'T', 'H', 'V']


In [41]:
results_3 = run_mc(1000, test_peptide2)


 Cycle number : 1


 Cycle number : 2


 Cycle number : 3


 Cycle number : 4


 Cycle number : 5


 Cycle number : 6


 Cycle number : 7


 Cycle number : 8


 Cycle number : 9


 Cycle number : 10



In [42]:
results_4 = run_mc(1000, test_peptide2, temp=1.0, nb_cycles=10)


 Cycle number : 1


 Cycle number : 2


 Cycle number : 3


 Cycle number : 4


 Cycle number : 5


 Cycle number : 6


 Cycle number : 7


 Cycle number : 8


 Cycle number : 9


 Cycle number : 10



### Observation

Quite interesting that in neither of the cases ($T=0$ or $T=1.0$), does there exist a mutation which would reduce the energy of the peptide. Can we conclude that a peptide such as *Cnksr2* already has a sequence already highly optimised. 

Here we see that the energy is close to 6. Let us consider a peptide which has a lower energy and see whether its sequence has been completely optimized or not. 

In [43]:
test_peptide3 = PDZ_Data.peptides[7]
print test_peptide3.name
print test_peptide3.energy_ground
print test_peptide3.sequence_bis

c-KIT
1.75989867909
['V', 'H', 'E', 'D', 'A']


In [45]:
results5 = run_mc(1000, test_peptide3)


 Cycle number : 1

Accepted [15, 1, 6, 14, 16] 1.73799844023 ['K', 'A', 'P', 'C', 'R'] 
Accepted [14, 6, 12, 1, 0] 1.50007999599 ['C', 'P', 'Q', 'A', 'G'] 
Accepted [15, 6, 5, 18, 17] 1.32208076125 ['K', 'P', 'M', 'D', 'H'] 
Accepted [8, 19, 6, 0, 16] 1.28674393368 ['W', 'E', 'P', 'G', 'R'] 

 Cycle number : 2

Accepted [15, 1, 0, 14, 6] 1.41863648794 ['K', 'A', 'G', 'C', 'P'] 
Accepted [16, 19, 17, 14, 5] 1.34354419115 ['R', 'E', 'H', 'C', 'M'] 
Accepted [1, 0, 16, 18, 2] 1.22749771693 ['A', 'G', 'R', 'D', 'V'] 

 Cycle number : 3

Accepted [6, 14, 0, 18, 6] 1.53576193154 ['P', 'C', 'G', 'D', 'P'] 
Accepted [6, 6, 16, 14, 2] 1.39976006589 ['P', 'P', 'R', 'C', 'V'] 

 Cycle number : 4

Accepted [9, 19, 12, 0, 10] 1.49012946035 ['S', 'E', 'Q', 'G', 'T'] 
Accepted [9, 19, 12, 0, 14] 1.38530876771 ['S', 'E', 'Q', 'G', 'C'] 

 Cycle number : 5

Accepted [15, 18, 4, 18, 13] 1.22881739285 ['K', 'D', 'I', 'D', 'Y'] 
Accepted [15, 17, 2, 18, 18] 1.2055680256 ['K', 'H', 'V', 'D', 'D'] 

 Cycle

In [50]:
results6 = run_mc(100, test_peptide3, temp=1.0, nb_cycles=1)


 Cycle number : 1

Run number: 65

Uniform 0.00255458796685 Ratio 0.00455167505647 Prob_Trans 0.00455167505647 
Accepted [1, 6, 12, 16, 12] 7.15215864858 ['A', 'P', 'Q', 'R', 'Q'] 

Run number: 66

Uniform 0.304753834197 Ratio 1.38376337551 Prob_Trans 1 
Accepted [1, 6, 18, 16, 12] 6.82735177746 ['A', 'P', 'D', 'R', 'Q'] 

Run number: 67

Uniform 0.77256317204 Ratio 1.45846224148 Prob_Trans 1 
Accepted [1, 6, 18, 2, 12] 6.44996915606 ['A', 'P', 'D', 'V', 'Q'] 

Run number: 68

Uniform 0.0332078140071 Ratio 3.75654957273 Prob_Trans 1 
Accepted [1, 6, 18, 2, 7] 5.1264682868 ['A', 'P', 'D', 'V', 'F'] 

Run number: 69

Uniform 0.877412086627 Ratio 5.75158140165 Prob_Trans 1 
Accepted [1, 6, 6, 2, 7] 3.37699344343 ['A', 'P', 'P', 'V', 'F'] 

Run number: 70

Uniform 0.163382794108 Ratio 1.17442230268 Prob_Trans 1 
Accepted [1, 18, 6, 2, 7] 3.21621707403 ['A', 'D', 'P', 'V', 'F'] 

Run number: 71

Uniform 0.0750633594662 Ratio 0.844591437387 Prob_Trans 0.844591437387 
Accepted [1, 18, 3, 2, 

### Observation

Apparently not! There still seems to be some kind of lee-way present in the sequence of the peptide *c-KIT* that it does allow certain mutations to take place without disturbing the interaction matrix. 