# Investigate which CDER feature(s) are improving model performance alongside SME features
Author: Amish Mishra  
Date: June 25, 2025  
Use `cder2` kernel  
This notebook loads in the feature dataframes from the `/features_dataframes` directory for training SME+CDER models for the HEEH topology. This notebook investigates using a forward-selection method of which CDER feature is contributing to model performance improvement over the baseline SME features model.

I copied the `6sme-cder-ml.ipynb` notebook as a template for this one.

## Cautionary Notes
Given a protein design topology (say HEEH):

For the bubble-chart for finding correlations and ranking by feature importance, the attributes of that chart were computed as follows
- the feature importance of the CDER features was found by training a model with no train/test split (just one big training set) on the pdbs
- the feature importance of the SME features was found the same way
- the correlations between the CDER and SME were found by analyzing the no train/test split dataset with both CDER and SME features

For the performance analysis
- a train/test split was performed **at the beginning** when selecting pdb files
- a model was trained on the training data by first using the pdbs to generate persistence diagrams (PDs), learn CDER coordinates, and then use the coordinates to vectorize the train PDs.

**Bottom line**: This means that when asking the question "why did we see a ~3% increase in performance of the models using SME+CDER over just using SME?" we have to be careful. We cannot directly assess what happens when we only add one CDER feature to the pipeline.

Instead, what I do in this notebook is take the no train/test split dataset with the SME+CDER features and train a 10-fold CV random forest model with just one CDER feature at a time and assess the average APS across the folds. This means that technically information from the test sets of each fold have helped in finding the CDER features for the training set as well.

In [1]:
# import time
# import pickle
import pandas
# import multidim
import numpy as np
# from sklearn import metrics
# from multidim.models import CDER
# from multidim.covertree import CoverTree
# from sklearn.metrics import confusion_matrix
# from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split
import scipy.stats as sps
from sklearn import model_selection
# from IPython.display import display

## Train 10 models, each using the SME features + one of the CDER features.
At the end, report which CDER feature yielded the highest APS score and what the score was.

In [10]:
topology = 'HEEH'

# Save the features dataframes
X_data_file = pandas.read_csv(f'features_dataframes/sme_cder_train_X_{topology}_no_train_test_split.csv')
y_train = pandas.read_csv(f'features_dataframes/sme_cder_train_ylabels_{topology}_no_train_test_split.csv')

perf_dict = {}
aps_df = pandas.DataFrame({'cv_fold': range(1, 6)})
for col in X_data_file.columns[:10]:
    print(f'{col} only included with SME features')
    X_train = X_data_file.loc[:, [col] + list(X_data_file.columns[-109:])]


    # =================== RF =========================
    # perform randomized search over rf hyperparameters

    # relabel classes from CDER colors to binary labels
    bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

    # Changed original n_estimators 1000 to 500
    rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
    max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
    max_depth.append(None)
    rf_param_grid = {'max_features': sps.uniform, 
                        'max_depth': max_depth,
                        'min_samples_split': [2, 5, 10, 20, 30],
                        'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

    rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                    rf_param_grid,
                                                    scoring='average_precision', 
                                                    n_iter = 100,
                                                    cv=5, 
                                                    n_jobs=-1,
                                                    verbose=1,
                                                    random_state=123)

    rf_clf_gs.fit(X_train, bin_labels_train)

    print('-- RF best params --')
    print(rf_clf_gs.best_params_)

    print('-- RF best APS --')
    print(rf_clf_gs.best_score_)

    # Save the performance metrics
    perf_dict[col] = {'rf_best_params': rf_clf_gs.best_params_,
                      'rf_best_aps': rf_clf_gs.best_score_}
    
    # Save the performance metrics to a CSV file
    # Create a DataFrame with the APS scores from each CV split
    aps_scores = [rf_clf_gs.cv_results_[f'split{i}_test_score'][0] for i in range(rf_clf_gs.n_splits_)]
    aps_df[f'{col}_aps'] = aps_scores

aps_df.to_csv(f'perf_dataframes/{topology}_cv_aps_scores_sme_with_one_cder_feature.csv', index=False)
aps_df


H_0_green[-1.1] only included with SME features
Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 110, 'max_features': 0.23247978560423577, 'min_samples_leaf': 10, 'min_samples_split': 20}
-- RF best APS --
0.8803638501944928
H_0_red[-1.24] only included with SME features
Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 110, 'max_features': 0.23247978560423577, 'min_samples_leaf': 10, 'min_samples_split': 20}
-- RF best APS --
0.8743686467395355
H_0_green[-0.94] only included with SME features
Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 80, 'max_features': 0.058938155611387155, 'min_samples_leaf': 6, 'min_samples_split': 5}
-- RF best APS --
0.8775784941826696
H_0_green[-1.21] only included with SME features
Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 10, 'max_features': 0.11720683786

Unnamed: 0,cv_fold,H_0_green[-1.1]_aps,H_0_red[-1.24]_aps,H_0_green[-0.94]_aps,H_0_green[-1.21]_aps,H_1_green[0.69 0.24]_aps,H_1_red[-0.41 2.95]_aps,H_1_red[27.75 0.04]_aps,H_1_green[11.89 0.13]_aps,H_2_green[1.53 0.08]_aps,H_2_red[11.09 0.11]_aps
0,1,0.959042,0.953241,0.953937,0.95644,0.958214,0.955089,0.9557,0.954871,0.953594,0.958404
1,2,0.76597,0.736091,0.751592,0.789819,0.753344,0.749776,0.755236,0.754573,0.748289,0.762552
2,3,0.796172,0.780172,0.779562,0.785734,0.793331,0.800281,0.781461,0.751065,0.778811,0.770692
3,4,0.817373,0.810226,0.823846,0.796061,0.827569,0.83738,0.837688,0.811379,0.824494,0.824204
4,5,0.911944,0.914937,0.907738,0.909684,0.917921,0.916072,0.91296,0.898806,0.905078,0.909684


In [11]:
# Show which key in the dictionary has the best APS
best_aps = max(perf_dict.items(), key=lambda x: x[1]['rf_best_aps'])
print(f'Best APS: {best_aps[1]["rf_best_aps"]} for {best_aps[0]} feature set')

Best APS: 0.8803638501944928 for H_0_green[-1.1] feature set


## Baseline SME-only model performance

In [12]:
topology = 'HEEH'

# Remove only the 'H_' columns from the dataframe
X_train = X_data_file.drop(X_data_file.filter(regex=r'^H_').columns, axis=1)

# =================== RF =========================
# perform randomized search over rf hyperparameters

# relabel classes from CDER colors to binary labels
bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

# Changed original n_estimators 1000 to 500
rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
rf_param_grid = {'max_features': sps.uniform, 
                    'max_depth': max_depth,
                    'min_samples_split': [2, 5, 10, 20, 30],
                    'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                rf_param_grid,
                                                scoring='average_precision', 
                                                n_iter = 100,
                                                cv=5, 
                                                n_jobs=-1,
                                                verbose=1,
                                                random_state=123)

rf_clf_gs.fit(X_train, bin_labels_train)

print('-- RF best params --')
print(rf_clf_gs.best_params_)

print('-- RF best APS --')
print(rf_clf_gs.best_score_) 

# Save the performance metrics to a CSV file
# Create a DataFrame with the APS scores from each CV split
aps_scores = [rf_clf_gs.cv_results_[f'split{i}_test_score'][0] for i in range(rf_clf_gs.n_splits_)]
aps_df = pandas.DataFrame({'cv_fold': range(1, 6), 'all_sme_aps': aps_scores})
aps_df.to_csv(f'perf_dataframes/{topology}_cv_aps_scores_sme_only.csv', index=False)
aps_df

Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 10, 'max_features': 0.11720683786950581, 'min_samples_leaf': 10, 'min_samples_split': 5}
-- RF best APS --
0.8719865215563611


Unnamed: 0,cv_fold,all_sme_aps
0,1,0.959122
1,2,0.748107
2,3,0.784201
3,4,0.797532
4,5,0.917209


## SME+CDER features model performance
Check what the best model performance is possible with SME+CDER for the no trian/test split data

In [13]:
topology = 'HEEH'

X_train = X_data_file

# =================== RF =========================
# perform randomized search over rf hyperparameters

# relabel classes from CDER colors to binary labels
bin_labels_train = np.array([1 if label == 'green' else 0 for label in y_train['true_label']])

# Changed original n_estimators 1000 to 500
rf_clf = RandomForestClassifier(n_estimators=500, class_weight='balanced')
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
rf_param_grid = {'max_features': sps.uniform, 
                    'max_depth': max_depth,
                    'min_samples_split': [2, 5, 10, 20, 30],
                    'min_samples_leaf': [1, 2, 4, 6, 8, 10]}

rf_clf_gs = model_selection.RandomizedSearchCV(rf_clf, 
                                                rf_param_grid,
                                                scoring='average_precision', 
                                                n_iter = 100,
                                                cv=5, 
                                                n_jobs=-1,
                                                verbose=1,
                                                random_state=123)

rf_clf_gs.fit(X_train, bin_labels_train)

print('-- RF best params --')
print(rf_clf_gs.best_params_)

print('-- RF best APS --')
print(rf_clf_gs.best_score_) 

# Save the performance metrics to a CSV file
# Create a DataFrame with the APS scores from each CV split
aps_scores = [rf_clf_gs.cv_results_[f'split{i}_test_score'][0] for i in range(rf_clf_gs.n_splits_)]
aps_df = pandas.DataFrame({'cv_fold': range(1, 6), 'sme_cder_aps': aps_scores})
aps_df.to_csv(f'perf_dataframes/{topology}_cv_aps_scores_sme_and_all_cder.csv', index=False)
aps_df

Fitting 5 folds for each of 100 candidates, totalling 500 fits
-- RF best params --
{'max_depth': 20, 'max_features': 0.28270293433269855, 'min_samples_leaf': 10, 'min_samples_split': 5}
-- RF best APS --
0.8847437444087068


Unnamed: 0,cv_fold,sme_cder_aps
0,1,0.963489
1,2,0.767416
2,3,0.779971
3,4,0.82618
4,5,0.90763
