Notebook for reading and manipulating data provided in the paper *PDZ Domain Selectivity is optimized across the mouse proteome* by Stiffler et al. The MDSM model is also implemented. 

In [1]:
import os
os.chdir('E:\Ecole\Year 3\Projet 3A')

In [2]:
import pandas as pd 
import numpy as np 

The data is provided in a multi-dimensional array. We write two classes: One called Domain which holds the data particular to the domain such as the values of the $\theta_{i,p,q}$ and the threshold values. 

The other class called Data holds all the data provided in the excel file. This class contains a list of Domains as well as a list of amino acids. 

These two classes make it easy to manipulate and read data. For testing the model proposed in the paper, it is important that we extract the $\theta_{p,q}$ for each domain. This is where the class Domain will be useful 

In [3]:
class Domain:
    def __init__(self, name):
        self.name = name
        self.thresholds = None
        self.thetas = None

In [4]:
class Data:
    def __init__(self, filename):
        self.filename = filename
        temp_df = pd.read_excel(self.filename)
        self.aminoacids = [acid.encode('utf-8') for acid in list(temp_df.columns[:20])]
        self.df = temp_df.T
        self.domains = [Domain(domain.encode('utf-8')) for domain in list(self.df.columns)]
        
    def create_domains(self):
        for domain in self.domains:
            domain.thetas = self.df[domain.name][:100]
            domain.thetas = np.asarray(domain.thetas)
            domain.thetas = domain.thetas.reshape(5,20)
            domain.thresholds = np.asarray(self.df[domain.name][100:])  

In [5]:
PDZ_Data = Data('Data_PDZ/MDSM_01_stiffler_bis.xls')
PDZ_Data.create_domains()