## Assemble gene-level data for regression analysis

Script to combine
* protein translation rates
* transcript abundancies
* initiation probabilities
* ORF lengths
* Codon Adaptation Indices

into one .csv file.

In [1]:
import csv
import cPickle as pkl
import pandas as pd

### 1. Protein translation rates

The following protein translation rates are normalized by transcript abundance:

In [2]:
prot_speeds = pkl.load(open("../parameters/prot_speeds.p"))

In [3]:
pd.Series(prot_speeds.values()).describe()

count    4475.000000
mean        1.546003
std         9.023957
min         0.001695
25%         0.055651
50%         0.140960
75%         0.431638
max       223.223164
dtype: float64

Units are finished molecules per second per transcript.

In [4]:
df1 = pd.DataFrame.from_dict(prot_speeds.items())
df1.columns = ['gene', 'modelled translation rate [s^-1]']

In [5]:
df1.head()

Unnamed: 0,gene,modelled translation rate [s^-1]
0,YBR177C,0.354237
1,YIL140W,0.110455
2,YLR268W,0.499435
3,YOR011W,0.010735
4,YPL043W,0.482203


Measured values from quantitative microarray analysis and subsequent modelling (Arava et al., http://www.pnas.org/content/100/7/3889.full):

### 2. Transcript abundancies

TODO: test also other transcriptomes.

In [6]:
transcriptome = pkl.load(open('../parameters/transcriptome_shah.p'))

In [7]:
pd.Series(transcriptome.values()).describe()

count    4839.000000
mean       12.400496
std        50.939523
min         0.000000
25%         2.000000
50%         3.000000
75%         7.000000
max      1381.000000
dtype: float64

In [8]:
df2 = pd.DataFrame.from_dict(transcriptome.items())
df2.columns = ['gene', 'transcript abundance']

In [9]:
df = pd.merge(df1, df2, left_on='gene', right_on='gene', how='outer')

In [10]:
df.head()

Unnamed: 0,gene,translation rate [s^-1],transcript abundance
0,YBR177C,0.354237,6
1,YIL140W,0.110455,2
2,YLR268W,0.499435,8
3,YOR011W,0.010735,1
4,YPL043W,0.482203,7


### 3. Initiation probabilities

Version from private email, Sept 30, 2015.

In [11]:
init_rates = pkl.load(open('../parameters/init_rates_plotkin.p'))

In [12]:
pd.Series(init_rates.values()).describe()

count    4.839000e+03
mean     1.567727e-06
std      1.128263e-06
min      9.375766e-10
25%      8.320521e-07
50%      1.291872e-06
75%      1.962904e-06
max      1.440641e-05
dtype: float64

Units are successful ribosome initiations per second.

In [13]:
df3 = pd.DataFrame.from_dict(init_rates.items())
df3.columns = ['gene', 'initiation rate [s^-1]']

In [14]:
df = pd.merge(df, df3, left_on='gene', right_on='gene', how='outer')

In [15]:
df.head()

Unnamed: 0,gene,translation rate [s^-1],transcript abundance,initiation rate [s^-1]
0,YBR177C,0.354237,6,1.644214e-06
1,YIL140W,0.110455,2,1.649229e-06
2,YLR268W,0.499435,8,1.844285e-06
3,YOR011W,0.010735,1,2.68682e-07
4,YPL043W,0.482203,7,2.096261e-06


### 4. ORF lengths

In [16]:
orf_genomic_dict = pkl.load(open("../parameters/orf_coding.p"))

In [17]:
orf_lengths = {prot: len(orf_genomic_dict[prot]) for prot in orf_genomic_dict}

In [18]:
pd.Series(orf_lengths.values()).describe()

count     6713.000000
mean      1352.414122
std       1139.682772
min         51.000000
25%        534.000000
50%       1077.000000
75%       1767.000000
max      14733.000000
dtype: float64

In [19]:
df4 = pd.DataFrame.from_dict(orf_lengths.items())
df4.columns = ['gene', 'ORF length [nts]']

In [20]:
df = pd.merge(df, df4, left_on='gene', right_on='gene', how='outer')

In [21]:
df.head()

Unnamed: 0,gene,translation rate [s^-1],transcript abundance,initiation rate [s^-1],ORF length [nts]
0,YBR177C,0.354237,6,1.644214e-06,1356
1,YIL140W,0.110455,2,1.649229e-06,2472
2,YLR268W,0.499435,8,1.844285e-06,645
3,YOR011W,0.010735,1,2.68682e-07,4185
4,YPL043W,0.482203,7,2.096261e-06,2058


### 5. Codon Adaptation Indices

In [22]:
cai_dict = pkl.load(open("../parameters/cai_dict.p"))

In [23]:
pd.Series(cai_dict.values()).describe()

count    5917.000000
mean        0.733184
std         0.044777
min         0.475284
25%         0.712289
50%         0.735966
75%         0.756314
max         0.922365
dtype: float64

In [24]:
df5 = pd.DataFrame.from_dict(cai_dict.items())
df5.columns = ['gene', 'CAI']

In [25]:
df = pd.merge(df, df5, left_on='gene', right_on='gene', how='outer')

In [26]:
df.head()

Unnamed: 0,gene,translation rate [s^-1],transcript abundance,initiation rate [s^-1],ORF length [nts],CAI
0,YBR177C,0.354237,6,1.644214e-06,1356,0.743503
1,YIL140W,0.110455,2,1.649229e-06,2472,0.742082
2,YLR268W,0.499435,8,1.844285e-06,645,0.710616
3,YOR011W,0.010735,1,2.68682e-07,4185,0.767968
4,YPL043W,0.482203,7,2.096261e-06,2058,0.768255


### 6. Save as CSV

In [27]:
df.to_csv('../parameters/regression_data.csv')