## Kelley hu4d5-5 VL-E55Y vdw step error analysis

In this notebook we will use a random forest model to find the most energetically influential degrees of freedom for the VL-E55Y vdw step TI production run. Next, we will compare the sampling of these DOF during TI production to a free energy profile derived from end state GaMD sampling. We will attempt to  correct any inaccurate sampling in the TI data and find the estimated ddG before and after the correction. 

In [1]:
import os
os.chdir("..")
from common_functions import *

### Ingesting original TI lambda production data

In [2]:
os.chdir("./TI_data/VL-E55Y")
geom_dvdls_crg = pd.read_csv("E55Y_crg_bound.csv")
geom_dvdls_crg_ub = pd.read_csv("E55Y_crg_unbound.csv")
geom_dvdls_vdw = pd.read_csv("E55Y_vdw_bound.csv")
dvdls_ub_vdw = pd.read_csv("E55Y_vdw_unbound.csv")


### Initial ddG estimate:

In [4]:
dG_bd_crg = geom_dvdls_crg.groupby("Lambda").mean()["weight_dvdl"].sum()
dG_ubd_crg = geom_dvdls_crg_ub.groupby("Lambda").mean()["weight_dvdl"].sum()
ddG_crg = dG_bd_crg - dG_ubd_crg

dG_bd_vdw = geom_dvdls_vdw.groupby("Lambda").mean()["weight_dvdl"].sum()
dG_ubd_vdw = dvdls_ub_vdw.groupby("Lambda").mean()["weight_dvdl"].sum()
ddG_vdw = dG_bd_vdw - dG_ubd_vdw

empirical_value = -0.18

print("Original ddG (crg step): ")
print(f"{round(ddG_crg, 4)} kcal/mol")

print("Original ddG (vdw step): ")
print(f"{round(ddG_vdw, 4)} kcal/mol")

print("Original total ddG: ")
print(f"{round(ddG_crg + ddG_vdw, 4)} kcal/mol")

print()
print("Empirical value: ")
print(f"{empirical_value} kcal/mol")

orig_error = abs((ddG_crg + ddG_vdw) - empirical_value)

print("Original ddG error: ")
print(f"{round(orig_error, 4)} kcal/mol")

Original ddG (crg step): 
-1.9403 kcal/mol
Original ddG (vdw step): 
0.5016 kcal/mol
Original total ddG: 
-1.4387 kcal/mol

Empirical value: 
-0.18 kcal/mol
Original ddG error: 
1.2587 kcal/mol


### Vdw step RF model

#### Splitting data into independent/dependent variables for random forest model

See our methods/supplemental methods section for our process to choose the input features.

In [5]:
X_5A = geom_dvdls_vdw.drop([
    "#Frame", "weight_dvdl", "dvdl", "Run", "Lambda", "F239_chi2", "D243_chi2", "Y570_D243_OD"
], axis=1)
Y = geom_dvdls_vdw["weight_dvdl"]

X_scl = pd.DataFrame(StandardScaler().fit_transform(X_5A))
X_scl.columns = X_5A.columns


#### Checking to see if there is any cross-correlations within the dataset

In [6]:
absCorr = abs(X_scl.corr())
for i in absCorr.columns:
    for j in absCorr.index:
        cor = absCorr.loc[i, j]
        if abs(cor) > 0.5 and i != j:
            print(i, j)
            print(cor)
            

#### Using random forest model to identify the most energetically influential degrees of freedom

We run our model 25 times, then sort the results by the mean of feature importance across the 25 iterations. For this particular perturbation, the model $R^2$ between the geometric DOF (nearby side chain rotamers or interatomic distances) and the energetic DV/DL was not strong enough for us to check the sampling of these degrees of freedom. 

In [8]:
rfeDefault = RFE(estimator=DecisionTreeRegressor(max_depth=5, random_state=42), n_features_to_select=0.75, step=0.05)
rfDefault = RandomForestRegressor(
    max_depth=10, n_estimators=200, oob_score=True, max_features=0.6, min_samples_leaf = 7, min_samples_split=14, random_state=42
)

pipelineDefault_rf = Pipeline([
    ('feature_scaling', StandardScaler()),
    # ('pre_select', kbest),
    ('feature_selection', rfeDefault),
    ('regression_model', rfDefault)
])


imps = benchmark_model(pipelineDefault_rf, X_scl, Y, geom_dvdls_vdw["Lambda"])
imps[["Mean", "Median"]].sort_values(by="Mean", ascending=False)[:15]

Avg. training r2: 
0.4738
Training r2 std dev: 
0.0026
Avg. test r2: 
0.3585
Testing r2 std dev: 
0.0058


Unnamed: 0,Mean,Median
Y570_D243_O,0.260578,0.260064
Y570_chi2,0.120656,0.12026
E410_chi1,0.063503,0.078098
L409_chi2,0.063033,0.058112
S411_chi1,0.051436,0.027972
Y101_chi2,0.041995,0.047173
E410_chi3,0.041208,0.024132
E410_chi2,0.037764,0.021359
L409_chi1,0.035676,0.020558
Y404_chi2,0.028616,0.011968
