To demonstrate the ``TreeModel().fit()``method, we obtain the ``DOM_GSEC`` example dataset and its respective feature set (see [Breimann25a]_):

In [8]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(10)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create a ``TreeModel`` object and fit it to obtain the importance of each feature and their standard deviation using the ``feat_importance`` and ``feat_importance_std`` attributes:

In [9]:
tm = aa.TreeModel()
tm.fit(X, labels=labels)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 4.682  9.968  9.458 16.027  7.644  4.682  6.62   9.335 11.505 20.08 ]
Their STD:  [0.166 0.446 0.246 0.12  0.288 0.397 0.279 0.583 0.687 0.597]


To obtain Monte Carlo estimates of the feature importance, the ``TreeModel().fit()`` method performs 5 rounds of model fitting and averages the feature importance across all rounds. The number of rounds can be adjusted using the ``n_rounds`` (default=5) parameter:

In [10]:
tm = aa.TreeModel()
tm.fit(X, labels=labels, n_rounds=1)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 5.179  9.541  8.865 16.063  7.606  4.709  6.181  9.568 11.244 21.044]
Their STD:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Moreover, it applies a recursive feature elimination (RFE) algorithm, which can be disabled by setting ``use_rfe=False``:

In [11]:
tm.fit(X, labels=labels, use_rfe=False)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 4.749  9.527  9.617 16.166  7.799  4.576  7.132  9.225 11.042 20.168]


The number of features selected per round is controlled by the ``n_feat_min`` and ``n_feat_max`` parameters:

In [12]:
tm.fit(X, labels=labels, n_feat_min=1, n_feat_max=3)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 0.     0.     5.676 43.966  5.15   0.     0.     0.     0.    45.207]


The performance measure for the evaluation during each RFE iteration can be set by the ``metric`` parameter (default=``accuracy``):

In [13]:
tm.fit(X, labels=labels, metric="recall")
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 4.838  9.73   9.652 16.553  7.613  4.456  6.76   9.181 11.113 20.105]


The features eliminated in each step is controlled by the ``step`` parameter (default=1), which can be set to ``None`` to remove in each iteration all features with the lowest importance. This offers a faster but less precise approach:

In [14]:
tm.fit(X, labels=labels, step=None)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 4.623  9.636  9.454 16.743  7.645  4.761  6.754  9.522 11.383 19.48 ]
