Notebook for reading and manipulating data provided in the paper *PDZ Domain Selectivity is optimized across the mouse proteome* by Stiffler et al. The MDSM model is also implemented. 

In [1]:
import os
os.chdir('E:\Ecole\Year 3\Projet 3A')

In [2]:
import pandas as pd 
import numpy as np 

The data is provided in a multi-dimensional array. We write two classes: One called Domain which holds the data particular to the domain such as the values of the $\theta_{i,p,q}$ and the threshold values. 

The other class called Data holds all the data provided in the excel file. This class contains a list of Domains as well as a list of amino acids. 

These two classes make it easy to manipulate and read data. For testing the model proposed in the paper, it is important that we extract the $\theta_{p,q}$ for each domain. This is where the class Domain will be useful 

In [3]:
class Domain:
    def __init__(self, name):
        self.name = name
        self.thresholds = None
        self.thetas = None

In [4]:
class Data:
    def __init__(self, filename):
        self.filename = filename
        temp_df = pd.read_excel(self.filename)
        self.aminoacids = [acid.encode('utf-8') for acid in list(temp_df.columns[:20])]
        self.df = temp_df.T
        self.domains = [Domain(domain.encode('utf-8')) for domain in list(self.df.columns)]
        self.names = [domain.name for domain in self.domains]
    def create_domains(self):
        for domain in self.domains:
            domain.thetas = self.df[domain.name][:100]
            domain.thetas = np.asarray(domain.thetas)
            domain.thetas = domain.thetas.reshape(5,20)
            domain.thresholds = np.asarray(self.df[domain.name][100:])  

In [5]:
PDZ_Data = Data('Data_PDZ/MDSM_01_stiffler_bis.xls')
PDZ_Data.create_domains()

Let us browse the data now using the these two classes. Lets start off with the amino acids.

In [6]:
print PDZ_Data.aminoacids

['G', 'A', 'V', 'L', 'I', 'M', 'P', 'F', 'W', 'S', 'T', 'N', 'Q', 'Y', 'C', 'K', 'R', 'H', 'D', 'E']


Let us now see what are the different PDZ domains that we have in the data.

In [7]:
print PDZ_Data.names

['Cipp (03/10)', 'Cipp (05/10)', 'Cipp (08/10)', 'Cipp (09/10)', 'Cipp (10/10)', 'D930005D10Rik (1/1)', 'Dlgh3 (1/1)', 'Dvl1 (1/1)', 'Dvl2 (1/1)', 'Dvl3 (1/1)', 'Erbin (1/1)', 'Gm1582 (2/3)', 'GRASP55 (1/1)', 'Grip1 (6/7)', 'Grip2 (5/7)', 'Harmonin (2/3)', 'HtrA1 (1/1)', 'HtrA3 (1/1)', 'Interleukin 16 (1/4)', 'LARG (1/1)', 'LIN-7A (1/1)', 'Lin7c (1/1)', 'Lnx1 (2/4)', 'Lrrc7 (1/1)', 'Magi-1 (2/6)', 'Magi-1 (4/6)', 'Magi-1 (6/6)', 'Magi-2 (5/6)', 'Magi-2 (6/6)', 'Magi-3 (2/5)', 'Magi-3 (5/5)', 'Mpp7 (1/1)', 'MUPP1 (01/13)', 'MUPP1 (05/13)', 'MUPP1 (10/13)', 'MUPP1 (11/13)', 'MUPP1 (12/13)', 'MUPP1 (13/13)', 'NHERF-1 (1/2)', 'NHERF-2 (2/2)', 'nNOS (1/1)', 'PAR-3 (3/3)', 'PAR3B (1/3)', 'PAR6B (1/1)', 'Pdlim5 (1/1)', 'Pdzk1 (1/4)', 'Pdzk1 (3/4)', 'Pdzk3 (1/1)', 'Pdzk3 (2/2)', 'Pdzk11 (1/1)', 'PDZ-RGS3 (1/1)', 'PSD95 (1/3)', 'PTP-BL (2/5)', 'SAP97 (1/3)', 'SAP97 (3/3)', 'SAP102 (3/3)', 'Scrb1 (1/4)', 'Scrb1 (2/4)', 'Scrb1 (3/4)', 'Semcap3 (1/2)', 'Shank1 (1/1)', 'Shank3 (1/1)', 'Shroom (1/1)

Let us now explore the data for a given PDZ domain, for example the 20th one. You can access the domains by their index. The 20th domain is *LIN-7A (1/1)*

In [8]:
PDZ_Data.domains[20]
print PDZ_Data.domains[20].name

LIN-7A (1/1)


In [9]:
test_domain = PDZ_Data.domains[20]
print test_domain.name

LIN-7A (1/1)


Each domain has two variables: thetas and thresholds. The thetas are the $\theta_{i,p,q}$ as mentioned in the paper, whereas the thresholds are the values used for determining whether the PDZ domain binds to a given peptide or not

In [10]:
print test_domain.thresholds

[ 7.1429  7.1429  7.3827]


The thetas form a 5X20 matrix, that is 5 positions in the C-terminal of the peptide considered and the 20 aminon acids. You can access the data for each of the positions using the index. We note that the position -4 in the peptide corresponds to the index 0, the position -3 to the index 1 and so on.

In [11]:
print test_domain.thetas[0]

[ 0.       -0.019589  0.062421  0.68493   0.49776  -0.47603   0.2779
 -1.1723   -0.44707  -0.06063  -0.23115  -0.1164   -0.1734   -0.73216
 -0.45431   0.71784   0.95649   0.37813  -0.3057   -0.28117 ]


In [12]:
test_domain.thetas.shape

(5L, 20L)