## Assemble gene-level data for regression analysis

Script to combine
* protein translation rates
* transcript abundancies
* initiation probabilities
* ORF lengths
* Codon Adaptation Indices

into one .csv file.

In [1]:
import csv
import cPickle as pkl
import pandas as pd

### 1. Protein translation rates

The following protein translation rates are normalized by transcript abundance:

In [2]:
prot_speeds = pkl.load(open("../parameters/prot_per_transcript_speeds.p"))

In [3]:
pd.Series(prot_speeds.values()).describe()

count    4475.000000
mean        0.052763
std         0.033543
min         0.000565
25%         0.029944
50%         0.045198
75%         0.066667
max         0.267514
dtype: float64

Units are finished molecules per second per transcript.

In [14]:
df1 = pd.DataFrame.from_dict(prot_speeds.items())
df1.columns = ['gene', 'modelled translation rate [s^-1]']

In [15]:
df1.head()

Unnamed: 0,gene,modelled translation rate [s^-1]
0,YBR177C,0.354237
1,YIL140W,0.110455
2,YLR268W,0.499435
3,YOR011W,0.010735
4,YPL043W,0.482203


Measured values from quantitative microarray analysis and subsequent modelling (Arava et al., http://www.pnas.org/content/100/7/3889.full, we choose the rates _per transcript_):

In [16]:
prot_rates_per_transcript = pkl.load(open("../parameters/prot_per_transcript_arava.p"))

In [17]:
pd.Series(prot_rates_per_transcript.values()).describe()

count    5643.000000
mean        0.116282
std         0.108930
min         0.001887
25%         0.041829
50%         0.095339
75%         0.159180
max         2.189396
dtype: float64

In [18]:
df = pd.DataFrame.from_dict(prot_rates_per_transcript.items())
df.columns = ['gene', 'experimental translation rate per transcript [s^-1]']

In [19]:
df = pd.merge(df, df1, left_on='gene', right_on='gene', how='outer')

In [20]:
df.head()

Unnamed: 0,gene,experimental translation rate per transcript [s^-1],modelled translation rate [s^-1]
0,YAL008W,0.201307,0.293785
1,YBR255W,0.02907,0.022317
2,YGR164W,0.285791,
3,YGR131W,0.12459,
4,YNL003C,0.072844,0.084181


### 2. Transcript abundancies

TODO: test also other transcriptomes.

In [21]:
transcriptome = pkl.load(open('../parameters/transcriptome_shah.p'))

In [22]:
pd.Series(transcriptome.values()).describe()

count    4839.000000
mean       12.400496
std        50.939523
min         0.000000
25%         2.000000
50%         3.000000
75%         7.000000
max      1381.000000
dtype: float64

In [23]:
df2 = pd.DataFrame.from_dict(transcriptome.items())
df2.columns = ['gene', 'transcript abundance']

In [24]:
df = pd.merge(df, df2, left_on='gene', right_on='gene', how='outer')

In [25]:
df.head()

Unnamed: 0,gene,experimental translation rate per transcript [s^-1],modelled translation rate [s^-1],transcript abundance
0,YAL008W,0.201307,0.293785,3.0
1,YBR255W,0.02907,0.022317,1.0
2,YGR164W,0.285791,,
3,YGR131W,0.12459,,
4,YNL003C,0.072844,0.084181,3.0


### 3. Initiation probabilities

Version from private email, Sept 30, 2015.

In [26]:
init_rates = pkl.load(open('../parameters/init_rates_plotkin.p'))

In [27]:
pd.Series(init_rates.values()).describe()

count    4.839000e+03
mean     1.567727e-06
std      1.128263e-06
min      9.375766e-10
25%      8.320521e-07
50%      1.291872e-06
75%      1.962904e-06
max      1.440641e-05
dtype: float64

Units are successful ribosome initiations per second.

In [28]:
df3 = pd.DataFrame.from_dict(init_rates.items())
df3.columns = ['gene', 'initiation rate [s^-1]']

In [29]:
df = pd.merge(df, df3, left_on='gene', right_on='gene', how='outer')

In [30]:
df.head()

Unnamed: 0,gene,experimental translation rate per transcript [s^-1],modelled translation rate [s^-1],transcript abundance,initiation rate [s^-1]
0,YAL008W,0.201307,0.293785,3.0,3.024409e-06
1,YBR255W,0.02907,0.022317,1.0,6.376904e-07
2,YGR164W,0.285791,,,
3,YGR131W,0.12459,,,
4,YNL003C,0.072844,0.084181,3.0,7.715578e-07


### 4. ORF lengths

In [31]:
orf_genomic_dict = pkl.load(open("../parameters/orf_coding.p"))

In [32]:
orf_lengths = {prot: len(orf_genomic_dict[prot]) for prot in orf_genomic_dict}

In [33]:
pd.Series(orf_lengths.values()).describe()

count     6713.000000
mean      1352.414122
std       1139.682772
min         51.000000
25%        534.000000
50%       1077.000000
75%       1767.000000
max      14733.000000
dtype: float64

In [34]:
df4 = pd.DataFrame.from_dict(orf_lengths.items())
df4.columns = ['gene', 'ORF length [nts]']

In [35]:
df = pd.merge(df, df4, left_on='gene', right_on='gene', how='outer')

In [36]:
df.head()

Unnamed: 0,gene,experimental translation rate per transcript [s^-1],modelled translation rate [s^-1],transcript abundance,initiation rate [s^-1],ORF length [nts]
0,YAL008W,0.201307,0.293785,3.0,3.024409e-06,597
1,YBR255W,0.02907,0.022317,1.0,6.376904e-07,2085
2,YGR164W,0.285791,,,,336
3,YGR131W,0.12459,,,,525
4,YNL003C,0.072844,0.084181,3.0,7.715578e-07,855


### 5. Codon Adaptation Indices

In [37]:
cai_dict = pkl.load(open("../parameters/cai_dict.p"))

In [38]:
pd.Series(cai_dict.values()).describe()

count    5917.000000
mean        0.733184
std         0.044777
min         0.475284
25%         0.712289
50%         0.735966
75%         0.756314
max         0.922365
dtype: float64

In [39]:
df5 = pd.DataFrame.from_dict(cai_dict.items())
df5.columns = ['gene', 'CAI']

In [40]:
df = pd.merge(df, df5, left_on='gene', right_on='gene', how='outer')

In [41]:
df.head()

Unnamed: 0,gene,experimental translation rate per transcript [s^-1],modelled translation rate [s^-1],transcript abundance,initiation rate [s^-1],ORF length [nts],CAI
0,YAL008W,0.201307,0.293785,3.0,3.024409e-06,597,0.709594
1,YBR255W,0.02907,0.022317,1.0,6.376904e-07,2085,0.723708
2,YGR164W,0.285791,,,,336,
3,YGR131W,0.12459,,,,525,0.729506
4,YNL003C,0.072844,0.084181,3.0,7.715578e-07,855,0.714273


### 6. Save as CSV

In [42]:
df.to_csv('../parameters/regression_data.csv')