## Machine learning analysis of 12,991 potentially-informative features derived from molecular modeling* for thier use in prediciting experimental $k_{cat}$, $K_M$, and $k_{cat}/K_M$

### *modern Rosetta protocol

We are interested in using machine learning tools to predict enzyme function from protein structure. In order to put this in machine learning tools, we need to identify informative features from the molecular modeling feature set $F$. 

In [2]:
from sklearn import preprocessing, linear_model, ensemble, feature_selection, model_selection, pipeline 

In [3]:
import pandas 
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row, column
from bokeh.plotting import figure
output_notebook()

In [4]:
! pwd

/Users/alex/Documents/bglb_family/machine_learning


## Features! 

In [5]:
! rsync -avz ca:/share/work/alex/bigger_constraint_exploration_run/data.h5 . 

/home/carlin/.bashrc: line 14: module: command not found
/home/carlin/.bashrc: line 15: module: command not found
/home/carlin/.bashrc: line 16: module: command not found
receiving file list ... done

sent 16 bytes  received 88 bytes  208.00 bytes/sec
total size is 23122290  speedup is 222329.71


In [6]:
feat = pandas.read_hdf('data.h5')
feat.head()

Unnamed: 0_level_0,0,0,0,0,0,0,0,0,0,0,...,447,447,447,447,447,447,447,447,447,447
Unnamed: 0_level_1,label,fa_atr,fa_rep,fa_sol,fa_intra_atr_xover4,fa_intra_rep_xover4,fa_intra_sol_xover4,lk_ball,lk_ball_iso,lk_ball_bridge,...,omega,fa_dun_dev,fa_dun_rot,fa_dun_semi,p_aa_pp,hxl_tors,ref,rama_prepro,total,sequence_position
A192S,weights,1.0,0.55,1.0,1.0,0.55,1.0,0.92,-0.38,-0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.99935,446
A227W,weights,1.0,0.55,1.0,1.0,0.55,1.0,0.92,-0.38,-0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.23068,446
A236E,weights,1.0,0.55,1.0,1.0,0.55,1.0,0.92,-0.38,-0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.63702,446
A249E,weights,1.0,0.55,1.0,1.0,0.55,1.0,0.92,-0.38,-0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.04588,446
A306N,weights,1.0,0.55,1.0,1.0,0.55,1.0,0.92,-0.38,-0.33,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.99261,446


In [7]:
feat.shape

(201, 12991)

Before we get in to the targets, let's get rid of as many of these features as we can. Having 12,991 features for only 200 samples is not great!

First, let's get rid of string features, since they are likely labels. Easiest way to do this is to get all the features that are float64. 

In [8]:
feat = feat.select_dtypes(['float64'])
feat.shape

(201, 12095)

And, let's get rid of features with 0 variance. 

In [9]:
zero_variance_features = []
for col in feat.columns:
    if feat[col].std() == 0.0:
        zero_variance_features.append(col)
feat = feat.drop(zero_variance_features, axis=1)
feat.shape

(201, 8401)

Ooooh, that is good, we get rid of another 3,694 useless features this way! 

Now that we have a "reasonable" set of features. Let's try looking at the targets. 

## Machine learning targets! 

In [10]:
targets = pandas.read_csv('/Users/alex/Documents/bglb_data_set/bglb_targets.csv', index_col=0)
targets.head()

Unnamed: 0_level_0,kcat,km,kcatkm
mutant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S14A,2.50515,0.916454,4.589089
T15A,2.788168,0.724276,5.063892
S16A,2.187521,1.146438,4.041274
S17A,2.928396,1.265996,4.66255
S17E,2.806858,0.864511,4.942484


Now, let's see if we can join these and get something out 

In [11]:
J = feat.join(targets)
J.head()



Unnamed: 0,"(0, fa_rep)","(0, fa_intra_rep_xover4)","(0, lk_ball)","(0, lk_ball_iso)","(0, lk_ball_bridge)","(0, lk_ball_bridge_uncpl)","(0, omega)","(0, fa_dun_dev)","(0, fa_dun_rot)","(0, fa_dun_semi)",...,"(447, lk_ball_iso)","(447, lk_ball_bridge)","(447, lk_ball_bridge_uncpl)","(447, fa_elec)","(447, fa_intra_elec)","(447, hbond_sc)","(447, total)",kcat,km,kcatkm
A192S,0.55,0.55,0.92,-0.38,-0.33,-0.33,0.48,0.69,0.76,0.78,...,-4.56208,-0.05648,-0.39996,-3.922,0.83169,-1.52653,7.99935,2.975891,0.706718,5.269158
A227W,0.55,0.55,0.92,-0.38,-0.33,-0.33,0.48,0.69,0.76,0.78,...,-4.52115,-0.06749,-0.46456,-3.80166,0.95059,-1.45225,6.23068,2.581608,1.230449,4.351159
A236E,0.55,0.55,0.92,-0.38,-0.33,-0.33,0.48,0.69,0.76,0.78,...,-4.5644,-0.07601,-0.40267,-4.09516,0.90852,-1.6566,4.63702,,,
A249E,0.55,0.55,0.92,-0.38,-0.33,-0.33,0.48,0.69,0.76,0.78,...,-4.56165,-0.05857,-0.41367,-4.02934,0.82877,-1.4172,7.04588,,,
A306N,0.55,0.55,0.92,-0.38,-0.33,-0.33,0.48,0.69,0.76,0.78,...,-4.58955,-0.0806,-0.41687,-4.15582,0.97967,-1.53085,4.99261,,,


OK, that'll work! Now, let's make a nice dict where we can do each of kcat, km, and kcat/KM at the same time! 

In [12]:
# opts = dict(plot_width=600, plot_height=400, min_border=0)
# p1 = figure(**opts)
# r1 = p1.circle([1,2,3], [4,5,6], size=20)

# p2 = figure(**opts)
# r2 = p2.circle([1,2,3], [4,5,6], size=20)

# # get a handle to update the shown cell with
# t = show(row(p1, p2), notebook_handle=True)

In [13]:
tgts = 'kcat km kcatkm'.split()

In [14]:
runs = []
for tgt in tgts:
    X = J.ix[:,0:-3]
    y = J[tgt]
    G = X.join(y).dropna()
    
    d = G.corr()[[tgt]]
    d['abs'] = d[tgt].map(abs)
    d = d.sort_values('abs', ascending=False).head(10)
    print(d)
    
    X = G.ix[:,0:-1]
    y = G.ix[:,-1].ravel()
    pkg = tgt, X, y 
    runs.append(pkg)
    print(tgt, X.shape, y.shape)

                                 kcat       abs
kcat                         1.000000  1.000000
(400, fa_dun_semi)           0.472867  0.472867
(14, fa_intra_sol_xover4)    0.465426  0.465426
(14, fa_intra_rep_xover4)   -0.456668  0.456668
(436, fa_dun_dev)           -0.440420  0.440420
(381, fa_atr)               -0.440063  0.440063
(400, lk_ball_bridge_uncpl)  0.431460  0.431460
(440, fa_elec)              -0.429781  0.429781
(14, lk_ball_bridge_uncpl)   0.428562  0.428562
(399, lk_ball_bridge)        0.424380  0.424380
kcat (113, 8401) (113,)
                                   km       abs
km                           1.000000  1.000000
(119, lk_ball_bridge)        0.403788  0.403788
(123, lk_ball_bridge)        0.403737  0.403737
(169, lk_ball)               0.399888  0.399888
(133, lk_ball_iso)          -0.393501  0.393501
(133, fa_atr)               -0.387767  0.387767
(401, hxl_tors)              0.381995  0.381995
(119, fa_atr)                0.381491  0.381491
(119, lk_ball_br

In [15]:
pts = []
for tgt, X, y in runs:
    print(tgt, X.shape, y.shape) 
    params = {}
    
    pln = pipeline.Pipeline([
        ('scaler', preprocessing.StandardScaler()),
        ('clf', linear_model.ElasticNetCV(cv=5, max_iter=1e9)),
    ])
    
    clf = model_selection.GridSearchCV(pln, params)
    preds = model_selection.cross_val_predict(clf, X, y, cv=4)
    
    opts = dict(plot_width=300, plot_height=300)
    plot = figure(**opts)
    plot.circle(y, preds)
    pts.append(plot)
    
show(row(pts), notebook_handle=True)    

kcat (113, 8401) (113,)




km (113, 8401) (113,)




kcatkm (113, 8401) (113,)


