## Optimizing Progress Coordinate and Conformational Selection for Molecular Dynamics
### ML Final Project : Darian Yang

#### Original Proposal: 
My project idea is to take an ensemble of MD trajectories where multiple candidate progress coordinates are calculated per frame, and to then use machine learning to rank each candidate coordinate and select the best progress descriptor. Another thought is to potentially standardize the coordinates, then take a linear combination of each coordinate and optimize the weight of each. I'm not sure what ML technique will be best here, but perhaps using decision trees for ranking purposes or optimizing a target function that can approximate the quality of a single or multi-dimensional coordinate. In the test/training datasets, I will use trajectories that make it to a pre-defined target state as the labeled successful input. For the actual simulation data to be used, I am not sure which dataset to choose yet but I have a few from my research projects available to choose from (all of which are proteins or protein-ligand complexes). Ideally I will try it out for multiple systems.

#### Current Implementation:
I will start with a 1µs standard MD simulation of 2 different states. I will calculate multiple features for each simulation and then feed all of this data into a classification model. Using cpptraj, I calculated 59 features for a 1000 frame subset of both 2kod and 1a43 trajectories.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection
from sklearn import linear_model
from sklearn import ensemble
from sklearn import svm

In [15]:
# first load the feature sets from each simulation dataset
d1 = np.loadtxt("data/2kod_features.dat")
d2 = np.loadtxt("data/1a43_features.dat")

In [16]:
# build feature dataset with both simulations
features = np.vstack((d1, d2))

In [17]:
features.shape

(2000, 60)

In [83]:
feat_names = np.loadtxt("data/2kod_features.dat", comments=None, max_rows=1, dtype=str)
feat_names

array(['#Frame', 'RMS_M1_NMR', 'RMS_H9M1_NMR', 'RMS_M2_NMR',
       'RMS_H9M2_NMR', 'RMS_Heavy_NMR', 'RMS_Backbone_NMR',
       'RMS_Dimer_Int_NMR', 'RMS_Key_Int_NMR', 'RMS_M1_XTAL',
       'RMS_H9M1_XTAL', 'RMS_M2_XTAL', 'RMS_H9M2_XTAL', 'RMS_Heavy_XTAL',
       'RMS_Backbone_XTAL', 'RMS_Dimer_Int_XTAL', 'RMS_Key_Int_XTAL',
       'RMS_M1_HEX', 'RMS_H9M1_HEX', 'RMS_M2_HEX', 'RMS_H9M2_HEX',
       'RMS_Heavy_HEX', 'RMS_Backbone_HEX', 'RMS_Dimer_Int_HEX',
       'RMS_Key_Int_HEX', 'RMS_M1_PENT', 'RMS_H9M1_PENT', 'RMS_M2_PENT',
       'RMS_H9M2_PENT', 'RMS_Heavy_PENT', 'RMS_Backbone_PENT',
       'RMS_Dimer_Int_PENT', 'RMS_Key_Int_PENT', 'c2_angle',
       'helix_angle_3pt', 'o_angle_m1', 'o_angle_m2', 'RoG', 'RoG-cut',
       'Total_SASA', 'Num_Inter_Contacts[native]',
       'Num_Inter_Contacts[nonnative]', 'Num_Intra_Contacts[native]',
       'Num_Intra_Contacts[nonnative]', 'M1-E175-Oe_M2-W184-He1',
       'M2-E175-Oe_M1-W184-He1', 'M1-E175-Oe_M1-T148-HG1',
       'M2-E175-Oe_M2-T148

In [20]:
# build the binary classification dataset
# 0 for every frame from d1 and 1 for every frame from d2
classifiers = np.hstack((np.zeros(d1.shape[0]), np.ones(d2.shape[0])))

In [21]:
classifiers.shape

(2000,)

In [98]:
# first scale the feature data
scaler = StandardScaler()
# note to skip the first column since this is the frame number
feats_scaled = scaler.fit_transform(features[:,1:])

In [146]:
# split training/test
X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(feats_scaled, classifiers, test_size=0.50)

In [147]:
X_train.shape

(1000, 59)

I scaled the data and split it, now I will build some basic models to try to classify which conformation the frame belongs to.

#### Scoring
To score my models, I will use the ROC AUC. The following reccommendations are from [here](https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc):

* You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities.
* You should not use it when your data is heavily imbalanced. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
* You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

In my case, I have an equal balance of positive (1) and negative (0) cases, and I care equally about both (one protein conformation or the other). Just in case, I will implement include other scoring methods as well.

In [152]:
from sklearn import metrics

def calc_score(model, score="auc", X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):
    """
    Find a scoring metric using an sklearn model.
    
    Parameters
    ----------
    score : str
        'auc', 'f1', 'acc'
    """
    model = model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    if score == "auc":
        metric = metrics.roc_auc_score(y_test, y_pred)
        #print(np.testing.assert_array_equal(y_test, y_pred))
    elif score == "f1":
        metric = metrics.f1_score(y_test, y_pred)
    elif score == "acc":
        metric = metrics.accuracy_score(y_test, y_pred)
        
    # RF only
    if hasattr(model, "oob_score_"):
        print(f"OOB: {model.oob_score_}")
        print(f"{score}: {metric}")
        #return metric
        #return model.oob_score_, auc
        return model
    else:
        print(f"{score}: {metric}")
        #return metric
        return model

In [154]:
calc_score(linear_model.LogisticRegression(), "auc")

auc: 1.0


LogisticRegression()

In [155]:
rf = calc_score(ensemble.RandomForestClassifier())
plt.plot(rf.feature_importances_)
# top feature
feat_names[np.where(rf.feature_importances_ == np.max(rf.feature_importances_))[0][0]]

TypeError: calc_score() missing 4 required positional arguments: 'X_train', 'y_train', 'X_test', and 'y_test'

In [138]:
top = np.argpartition(rf.feature_importances_, -10)[-10:]
feat_names[top]

array(['RMS_Backbone_PENT', 'o_angle_m1', 'RMS_Dimer_Int_XTAL',
       'RMS_Backbone_XTAL', 'RMS_H9M2_XTAL', 'RMS_H9M2_PENT',
       'RMS_Dimer_Int_PENT', 'RMS_M1_PENT', 'c2_angle',
       'RMS_Dimer_Int_NMR'], dtype='<U29')

Next maybe run this 100 times and collect a weighted histogram of the resulting rankings? 

Also later, maybe take the C$\alpha$ inter monomer distance matrix and look for best coordinate?