## Machine learning analysis 8,959 potentially-informative features derived from molecular modeling* for thier use in prediciting experimental $k_{cat}$, $K_M$, and $k_{cat}/K_M$

### *stock enzyme design protocol

We are interested in using machine learning tools to predict enzyme function from protein structure. In order to put this in machine learning tools, we need to identify informative features from the molecular modeling feature set $F$. 

In [47]:
%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns 

In [48]:
from sklearn import preprocessing, linear_model, ensemble, feature_selection, model_selection, pipeline 

In [49]:
import pandas 
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row, column
from bokeh.plotting import figure
output_notebook()

In [50]:
! pwd

/Users/alex/Documents/bglb_family/machine_learning


## Features! 

In [51]:
! rsync -avz ca:/share/work/alex/stock_enzyme_design_protocol/data.h5 data2.h5 

/home/carlin/.bashrc: line 14: module: command not found
/home/carlin/.bashrc: line 15: module: command not found
/home/carlin/.bashrc: line 16: module: command not found
receiving file list ... done

sent 16 bytes  received 88 bytes  208.00 bytes/sec
total size is 14344384  speedup is 137926.77


In [52]:
feat = pandas.read_hdf('data2.h5')
feat.head()

Unnamed: 0_level_0,0,0,0,0,0,0,0,0,0,0,...,447,447,447,447,447,447,447,447,447,447
Unnamed: 0_level_1,label,fa_atr,fa_rep,fa_sol,fa_intra_rep,fa_elec,pro_close,hbond_sr_bb,hbond_lr_bb,hbond_bb_sc,...,hbond_sc,dslf_fa13,rama,omega,fa_dun,p_aa_pp,yhh_planarity,ref,total,sequence_position
A192S,weights,1.0,0.55,0.9375,0.005,0.875,1.25,1.17,1.17,1.17,...,-3.03171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.89029,446
A306N,weights,1.0,0.55,0.9375,0.005,0.875,1.25,1.17,1.17,1.17,...,-3.18737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.42077,446
A356A,weights,1.0,0.55,0.9375,0.005,0.875,1.25,1.17,1.17,1.17,...,-3.35449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.26879,446
A357A,weights,1.0,0.55,0.9375,0.005,0.875,1.25,1.17,1.17,1.17,...,-3.33426,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.19985,446
E177K,weights,1.0,0.55,0.9375,0.005,0.875,1.25,1.17,1.17,1.17,...,-3.3662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4.37823,446


In [53]:
feat.shape

(182, 8959)

Before we get in to the targets, let's get rid of as many of these features as we can. Having 12,991 features for only 200 samples is not great!

First, let's get rid of string features, since they are likely labels. Easiest way to do this is to get all the features that are float64. 

In [54]:
feat = feat.select_dtypes(['float64'])
feat.shape

(182, 8063)

And, let's get rid of features with 0 variance. 

In [55]:
zero_variance_features = []
for col in feat.columns:
    if feat[col].std() == 0.0:
        zero_variance_features.append(col)
feat = feat.drop(zero_variance_features, axis=1)
feat.shape

(182, 5023)

Ooooh, that is good, we get rid of another 3,694 useless features this way! 

Now that we have a "reasonable" set of features. Let's try looking at the targets. 

In [56]:
# ha, let's try just the ligand 
not_lig_features = []
for col in feat.columns:
    if col[0] != 1:
        not_lig_features.append(col)
feat = feat.drop(not_lig_features, axis=1)
feat.shape

(182, 18)

## Machine learning targets! 

In [57]:
targets = pandas.read_csv('/Users/alex/Documents/bglb_data_set/bglb_targets.csv', index_col=0)
targets.head()

Unnamed: 0_level_0,kcat,km,kcatkm
mutant_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
S14A,2.50515,0.916454,4.589089
T15A,2.788168,0.724276,5.063892
S16A,2.187521,1.146438,4.041274
S17A,2.928396,1.265996,4.66255
S17E,2.806858,0.864511,4.942484


Now, let's see if we can join these and get something out 

In [58]:
J = feat.join(targets)
J.head()



Unnamed: 0,"(1, fa_atr)","(1, fa_rep)","(1, fa_sol)","(1, fa_intra_rep)","(1, fa_elec)","(1, pro_close)","(1, hbond_sr_bb)","(1, hbond_lr_bb)","(1, hbond_bb_sc)","(1, hbond_sc)",...,"(1, rama)","(1, omega)","(1, fa_dun)","(1, p_aa_pp)","(1, yhh_planarity)","(1, ref)","(1, total)",kcat,km,kcatkm
A192S,-2439.23,288.86,1332.84,5.18284,-236.725,100.019,-128.01,-68.4796,-46.1582,-70.5532,...,-23.6672,39.1865,639.732,-67.317,0.09919,-14.3468,-689.705,2.975891,0.706718,5.269158
A306N,-2437.94,285.688,1331.59,5.15594,-235.467,100.044,-128.002,-68.5066,-46.9282,-70.1182,...,-23.6909,38.6848,640.445,-68.0257,0.10636,-14.3468,-692.438,,,
A356A,-2439.84,286.798,1333.0,5.15388,-234.471,99.9919,-127.809,-68.478,-45.9961,-69.9346,...,-23.6139,39.4164,639.773,-67.1738,0.06791,-14.3468,-688.593,,,
A357A,-2438.82,285.981,1332.22,5.14629,-234.779,100.028,-127.932,-68.6586,-47.1987,-69.7857,...,-23.9853,38.9222,640.27,-67.4892,0.02991,-14.3468,-691.527,,,
E177K,-2438.84,286.267,1332.18,5.1584,-235.301,100.017,-127.977,-68.5239,-46.1104,-69.8084,...,-24.1499,39.0759,640.558,-67.5385,0.0834,-14.3468,-690.389,2.744293,0.791691,4.952352


OK, that'll work! Now, let's make a nice dict where we can do each of kcat, km, and kcat/KM at the same time! 

In [59]:
# opts = dict(plot_width=600, plot_height=400, min_border=0)
# p1 = figure(**opts)
# r1 = p1.circle([1,2,3], [4,5,6], size=20)

# p2 = figure(**opts)
# r2 = p2.circle([1,2,3], [4,5,6], size=20)

# # get a handle to update the shown cell with
# t = show(row(p1, p2), notebook_handle=True)

In [60]:
tgts = 'kcat km kcatkm'.split()

In [61]:
runs = []
for tgt in tgts:
    X = J.ix[:,0:-3]
    y = J[tgt]
    G = X.join(y).dropna()
    
    d = G.corr()[[tgt]]
    d['abs'] = d[tgt].map(abs)
    d = d.sort_values('abs', ascending=False).head(10)
    print(d)
    
    X = G.ix[:,0:-1]
    y = G.ix[:,-1].ravel()
    pkg = tgt, X, y 
    runs.append(pkg)
    print(tgt, X.shape, y.shape)

                      kcat       abs
kcat              1.000000  1.000000
(1, hbond_lr_bb)  0.205450  0.205450
(1, fa_atr)       0.182210  0.182210
(1, hbond_bb_sc)  0.180664  0.180664
(1, hbond_sr_bb)  0.156433  0.156433
(1, total)        0.138169  0.138169
(1, fa_sol)      -0.106656  0.106656
(1, fa_dun)       0.105193  0.105193
(1, omega)       -0.090909  0.090909
(1, fa_rep)      -0.085017  0.085017
kcat (102, 18) (102,)
                          km       abs
km                  1.000000  1.000000
(1, yhh_planarity) -0.124380  0.124380
(1, p_aa_pp)       -0.106868  0.106868
(1, fa_atr)        -0.100092  0.100092
(1, hbond_sc)       0.092302  0.092302
(1, fa_rep)        -0.083199  0.083199
(1, fa_elec)        0.075598  0.075598
(1, hbond_sr_bb)   -0.071692  0.071692
(1, fa_intra_rep)   0.066588  0.066588
(1, total)         -0.065614  0.065614
km (102, 18) (102,)
                    kcatkm       abs
kcatkm            1.000000  1.000000
(1, fa_atr)       0.186198  0.186198
(1, hbond_l

In [62]:
pts = []
for tgt, X, y in runs:
    print(tgt, X.shape, y.shape) 
    params = {}
    
    pln = pipeline.Pipeline([
        ('scaler', preprocessing.StandardScaler()),
        ('clf', linear_model.ElasticNetCV(cv=5)),
    ])
    
    clf = model_selection.GridSearchCV(pln, params)
    preds = model_selection.cross_val_predict(clf, X, y, cv=4)
    
    opts = dict(plot_width=300, plot_height=300)
    plot = figure(**opts)
    plot.circle(y, preds)
    pts.append(plot)
    
show(row(pts), notebook_handle=True)    

kcat (102, 18) (102,)




km (102, 18) (102,)




kcatkm (102, 18) (102,)


