To demonstrate the ``TreeModel().eval()``method, we obtain the ``DOM_GSEC`` example dataset and its respective feature set (see [Breimann25a]_):

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create two feature selections using the ``is_preselected`` parameter of the ``TreeModel`` class and its ``.fit()`` method:

In [2]:
import numpy as np

tm = aa.TreeModel()
is_selected = tm.fit(X=X, labels=labels).is_selected_

# Pre-selected from top 20
is_preselected_top20 = np.asarray(df_feat.index < 20)
tm = aa.TreeModel(is_preselected=is_preselected_top20)
is_selected_top20 = tm.fit(X=X, labels=labels).is_selected_

To evaluate different feature selections, provide ``X``, ``labels``, and the feature selection in terms of boolean 2D arrays using the ``list_is_selected`` parameters:

In [3]:
list_is_selected = [is_selected, is_selected_top20]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected)
aa.display_df(df_eval)

Unnamed: 0,name,accuracy,precision,recall,f1
1,Set 1,0.8635,0.8463,0.895,0.865
2,Set 2,0.8149,0.8145,0.8333,0.822


You can also use 1D boolean masks by setting ``convert_1d_to_2d=True``. To demonstrate this we create three different boolean masks based on different scale categories:

In [4]:
mask_volume = np.asarray(df_feat["category"] == "ASA/Volume")
mask_conformation = np.asarray(df_feat["category"] == "Conformation")
mask_energy = np.asarray(df_feat["category"] == "Energy")

list_is_selected = [mask_volume, mask_conformation, mask_energy]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True)
aa.display_df(df_eval)

Unnamed: 0,name,accuracy,precision,recall,f1
1,Set 1,0.8138,0.8395,0.8179,0.8219
2,Set 2,0.8385,0.8547,0.8673,0.8403
3,Set 3,0.8257,0.8226,0.8667,0.8261


Provide the names of the feature selections using the ``names_feature_selections`` parameter:

In [5]:
names_feature_selections = ["ASA/Volume", "Conformation", "Energy"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, names_feature_selections=names_feature_selections)
aa.display_df(df_eval)

Unnamed: 0,name,accuracy,precision,recall,f1
1,ASA/Volume,0.834,0.8216,0.8026,0.8208
2,Conformation,0.8383,0.8534,0.9064,0.8547
3,Energy,0.8297,0.8394,0.834,0.8209


The evaluation strategy can be adjusting by changing the number cross-validation folds (``n_cv``, default=5) and the scoring metrics via the ``list_metrics`` parameter (default=["accuracy", "recall", "precision", "f1"]):

In [6]:
list_metrics = ["balanced_accuracy", "roc_auc"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, list_metrics=list_metrics)
aa.display_df(df_eval)

Unnamed: 0,name,balanced_accuracy,roc_auc
1,Set 1,0.8349,0.8867
2,Set 2,0.8401,0.9544
3,Set 3,0.8321,0.9118
