## Kelley hu4d5-5 VL-H91A error analysis

In this notebook we will use a random forest model to find the most energetically influential degrees of freedom for the VL-H91A TI production run. Next, we will compare the sampling of these DOF during TI production to a free energy profile derived from end state GaMD sampling. We will attempt to  correct any inaccurate sampling in the TI data and find the estimated ddG before and after the correction. 

In [2]:
import os
os.chdir("..")
from common_functions import *

### Ingesting initial TI data

In [3]:
os.chdir("./TI_data/VL-H91A/")
geom_dvdls = pd.read_csv("H91A_bound.csv")
geom_dvdls_ub = pd.read_csv("H91A_unbound.csv")

### Original ddG estimate:

In [4]:
orig_dG_bd = geom_dvdls.groupby("Lambda").mean().sum()["weight_dvdl"]
orig_dG_ubd = geom_dvdls_ub.groupby("Lambda").mean().sum()["weight_dvdl"]

empirical_value = 3.15
orig_error = abs((orig_dG_bd - orig_dG_ubd) - empirical_value)

print("Original ddG estimate: ")
print(f"{round(orig_dG_bd - orig_dG_ubd, 4)} kcal/mol ")
print()
print("Original ddG error: ")
print(f"{round(orig_error, 4)} kcal/mol")

Original ddG estimate: 
3.3634 kcal/mol 

Original ddG error: 
0.2134 kcal/mol


### Removing correlated variables among candidate degrees of freedom

This is important because we want to limit the noise in our model training. See our methods/supplemental methods section for our process to choose the input features.

In [5]:
X = geom_dvdls.drop([
    "#Frame", "weight_dvdl", "dvdl", "Run", "Lambda", "Y404_chi2", "Q445_chi3", 
], axis=1)

Y = geom_dvdls["weight_dvdl"]

X_scl = pd.DataFrame(StandardScaler().fit_transform(X))
X_scl.columns = X.columns


### Checking to see if there is any cross-correlations within the independent variables

In [6]:
absCorr = abs(X_scl.corr())
for i in absCorr.columns:
    for j in absCorr.index:
        cor = absCorr.loc[i, j]
        if abs(cor) > 0.5 and i != j:
            print(i, j)
            print(cor)
            

H446_chi1 ser185_h226
0.5874333398420051
ser185_h226 H446_chi1
0.5874333398420051


### Using random forest model to identify the most energetically influential degrees of freedom

We run our model 25 times, then sort the results by the mean of feature importance across the 25 iterations. For this particular perturbation, the model $R^2$ between the geometric DOF (nearby side chain rotamers or interatomic distances) and the energetic DV/DL was not strong enough for us to check the sampling of these degrees of freedom. 

In [11]:
rfeDefault = RFE(estimator=DecisionTreeRegressor(max_depth=5, random_state=42), n_features_to_select=0.75, step=0.05)
rfDefault = RandomForestRegressor(
    max_depth=10, n_estimators=200, oob_score=True, max_features=0.6, min_samples_leaf = 7, min_samples_split=14, random_state=42
)

pipelineDefault_rf = Pipeline([
    ('feature_scaling', StandardScaler()),
    # ('pre_select', kbest),
    ('feature_selection', rfeDefault),
    ('regression_model', rfDefault)
])


imps = benchmark_model(pipelineDefault_rf, X_scl, Y, geom_dvdls["Lambda"])
imps[["Mean", "Median"]].sort_values(by="Mean", ascending=False)[:15]

Avg. training r2: 
0.5323
Training r2 std dev: 
0.0005
Avg. test r2: 
0.3369
Testing r2 std dev: 
0.0053


Unnamed: 0,Mean,Median
ser185_h226,0.225089,0.224943
H446_chi2,0.124295,0.123997
Y404_chi1,0.087022,0.089666
T448_chi1,0.080071,0.080138
V384_chi1,0.079857,0.079861
Q444_chi3,0.075968,0.075987
Y447_chi1,0.055016,0.054895
T386_chi1,0.054368,0.054348
F453_chi2,0.041323,0.041088
Y240_chi2,0.032619,0.032671
