# Investigate which CDER feature(s) are improving model performance alongside SME features
Author: Amish Mishra  
Date: June 25, 2025  
Use `cder2` kernel  
This notebook loads in the feature dataframes from the `/features_dataframes` directory for training SME+CDER models for the HEEH topology. This notebook investigates using a forward-selection method of which CDER feature is contributing to model performance improvement over the baseline SME features model.

I copied the `6sme-cder-ml.ipynb` notebook as a template for this one.

## Cautionary Notes
Given a protein design topology (say HEEH):

For the bubble-chart for finding correlations and ranking by feature importance, the attributes of that chart were computed as follows
- the feature importance of the CDER features was found by training a model with no train/test split (just one big training set) on the pdbs
- the feature importance of the SME features was found the same way
- the correlations between the CDER and SME were found by analyzing the no train/test split dataset with both CDER and SME features

For the performance analysis
- a train/test split was performed **at the beginning** when selecting pdb files
- a model was trained on the training data by first using the pdbs to generate persistence diagrams (PDs), learn CDER coordinates, and then use the coordinates to vectorize the train PDs.

**Bottom line**: This means that when asking the question "why did we see a ~3% increase in performance of the models using SME+CDER over just using SME?" we have to be careful. We cannot directly assess what happens when we only add one CDER feature to the pipeline.

Instead, what I do in this notebook is take the no train/test split dataset with the SME+CDER features and train a 10-fold CV random forest model with just one CDER feature at a time and assess the average APS across the folds. This means that technically information from the test sets of each fold have helped in finding the CDER features for the training set as well.

In [3]:
# import time
# import pickle
import pandas
# import multidim
import numpy as np
# from sklearn import metrics
# from multidim.models import CDER
# from multidim.covertree import CoverTree
# from sklearn.metrics import confusion_matrix
# from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split
import scipy.stats as sps
from sklearn import model_selection
# from IPython.display import display

## Train a model on just the CDER features for the no train/test split

In [None]:
topology = 'HEEH'

# Save the features dataframes
X_data_file = pandas.read_csv(f'features_dataframes/sme_cder_train_X_{topology}_no_train_test_split.csv')
y_train = pandas.read_csv(f'features_dataframes/sme_cder_train_ylabels_{topology}_no_train_test_split.csv')

perf_dict = {}
for col in X_data_file.columns[:10]:
    print(f'{col} only included with SME features')
    X_train = X_data_file.loc[:, [col] + list(X_data_file.columns[-109:])]


    # =================== RF =========================
    # perform randomized search over rf hyperparameters

    # relabel classes from CDER colors to binary labels
    bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

    # Changed original n_estimators 1000 to 100, and n_iter 100 to 10
    rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
    max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
    max_depth.append(None)
    rf_param_grid = {'max_features': sps.uniform, 
                        'max_depth': max_depth,
                        'min_samples_split': [2, 5, 10, 20, 30],
                        'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

    rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                    rf_param_grid,
                                                    scoring='average_precision', 
                                                    n_iter = 10,
                                                    cv=10, 
                                                    n_jobs=-1,
                                                    verbose=1)

    rf_clf_gs.fit(X_train, bin_labels_train)

    print('-- RF best params --')
    print(rf_clf_gs.best_params_)

    print('-- RF best APS --')
    print(rf_clf_gs.best_score_)

    # Save the performance metrics
    perf_dict[col] = {'rf_best_params': rf_clf_gs.best_params_,
                      'rf_best_aps': rf_clf_gs.best_score_}
    # classifier_path = f'classifiers/cder_{topology}_rf_clf_gs_no_train_test_split.pickle'
    # with open(classifier_path, 'wb') as f:
    #     pickle.dump(rf_clf_gs, f, protocol=pickle.HIGHEST_PROTOCOL)  

H_0_green[-1.1] only included with SME features
Fitting 10 folds for each of 10 candidates, totalling 100 fits
-- RF best params --
{'max_depth': 80, 'max_features': 0.21395444988101142, 'min_samples_leaf': 10, 'min_samples_split': 20}
-- RF best APS --
0.8781952867927749
H_0_red[-1.24] only included with SME features
Fitting 10 folds for each of 10 candidates, totalling 100 fits
-- RF best params --
{'max_depth': 60, 'max_features': 0.046387077770056906, 'min_samples_leaf': 10, 'min_samples_split': 10}
-- RF best APS --
0.8579052365691299
H_0_green[-0.94] only included with SME features
Fitting 10 folds for each of 10 candidates, totalling 100 fits
-- RF best params --
{'max_depth': 40, 'max_features': 0.22332512638773205, 'min_samples_leaf': 10, 'min_samples_split': 2}
-- RF best APS --
0.8665205640234663
H_0_green[-1.21] only included with SME features
Fitting 10 folds for each of 10 candidates, totalling 100 fits
-- RF best params --
{'max_depth': 80, 'max_features': 0.416121896639

In [30]:
# Show which key in the dictionary has the best APS
best_aps = max(perf_dict.items(), key=lambda x: x[1]['rf_best_aps'])
print(f'Best APS: {best_aps[1]["rf_best_aps"]} for {best_aps[0]} feature set')

Best APS: 0.8781952867927749 for H_0_green[-1.1] feature set


## Baseline SME-only model performance

In [None]:
topology = 'HEEH'

# Remove only the 'H_' columns from the dataframe
X_train = X_data_file.drop(X_data_file.filter(regex=r'^H_').columns, axis=1)

# =================== RF =========================
# perform randomized search over rf hyperparameters

# relabel classes from CDER colors to binary labels
bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

# Changed original n_estimators 1000 to 100, and n_iter 100 to 10
rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
rf_param_grid = {'max_features': sps.uniform, 
                    'max_depth': max_depth,
                    'min_samples_split': [2, 5, 10, 20, 30],
                    'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                rf_param_grid,
                                                scoring='average_precision', 
                                                n_iter = 10,
                                                cv=10, 
                                                n_jobs=-1,
                                                verbose=1)

rf_clf_gs.fit(X_train, bin_labels_train)

print('-- RF best params --')
print(rf_clf_gs.best_params_)

print('-- RF best APS --')
print(rf_clf_gs.best_score_) 

Fitting 10 folds for each of 10 candidates, totalling 100 fits


## SME+CDER features model performance

In [None]:
topology = 'HEEH'

X_train = X_data_file

# =================== RF =========================
# perform randomized search over rf hyperparameters

# relabel classes from CDER colors to binary labels
bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

# Changed original n_estimators 1000 to 100, and n_iter 100 to 10
rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
rf_param_grid = {'max_features': sps.uniform, 
                    'max_depth': max_depth,
                    'min_samples_split': [2, 5, 10, 20, 30],
                    'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                rf_param_grid,
                                                scoring='average_precision', 
                                                n_iter = 10,
                                                cv=10, 
                                                n_jobs=-1,
                                                verbose=1)

rf_clf_gs.fit(X_train, bin_labels_train)

print('-- RF best params --')
print(rf_clf_gs.best_params_)

print('-- RF best APS --')
print(rf_clf_gs.best_score_) 