To demonstrate the ``TreeModel.fit()``method, we obtain the ``DOM_GSEC`` example dataset and its respective feature set (see [Breimann24c]_)::

In [1]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(15)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create a ``TreeModel`` object and fit it to obtain the importance of each feature and their standard deviation using the ``feat_importance`` and ``feat_importance_std`` attributes:

In [2]:
tm = aa.TreeModel()
tm.fit(X, labels=labels)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 1.366  9.789  6.282  8.597  4.876  0.977  3.32   4.681  6.624  9.47
 20.422  2.162  4.322  9.888  7.225]
Their STD:  [1.682 0.974 0.846 0.619 0.507 1.954 1.69  0.928 0.76  0.472 0.399 1.766
 0.125 0.933 0.297]


To obtain Monte Carlo estimates of the feature importance, the ``TreeModel.fit()`` method performs 5 rounds of model fitting and averages the feature importance across all rounds. The number of rounds can be adjusted using the ``n_rounds`` parameter:

In [3]:
tm = aa.TreeModel()
tm.fit(X, labels=labels, n_rounds=1)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 4.671 11.231  7.857  9.104  6.163  0.     0.     0.     6.779 10.903
 23.284  0.     0.    11.213  8.796]
Their STD:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Moreover, it applies a recursive feature elimination (RFE) algorithm, which can be disabled by setting ``use_rfe=False``:

In [4]:
tm.fit(X, labels=labels, use_rfe=False)
feat_importance = tm.feat_importance

print("Feature importance: ", feat_importance)

Feature importance:  [ 3.295  8.818  5.499  8.106  4.547  3.774  3.977  3.885  5.45   9.482
 19.565  3.439  3.523  9.369  7.269]


The number of features selected per round is controlled by the ``n_feat_min`` and ``n_feat_max`` parameters:

In [5]:
tm.fit(X, labels=labels, n_feat_min=1, n_feat_max=3)
feat_importance = tm.feat_importance

print("Feature importance: ", feat_importance)

Feature importance:  [ 0.     0.     0.     0.     0.     0.     0.     0.     0.    15.102
 69.986  0.     0.    14.913  0.   ]


The performance measure for the evaluation during each RFE iteration can be set by the ``metric`` parameter (default=``accuracy``):

In [6]:
tm.fit(X, labels=labels, metric="recall")
feat_importance = tm.feat_importance

print("Feature importance: ", feat_importance)

Feature importance:  [ 2.275 10.089  6.208  8.666  5.225  0.     2.57   4.976  6.653  9.748
 21.221  1.403  3.528 10.155  7.282]


The features eliminated in each step is controlled by the ``step`` parameter (default=1), which can be set to ``None`` to remove in each iteration all features with the lowest importance. This offers a faster but less precise approach:

In [7]:
tm.fit(X, labels=labels, step=None)
feat_importance = tm.feat_importance

print("Feature importance: ", feat_importance)

Feature importance:  [ 1.533  9.588  6.41   9.032  5.193  0.     3.161  4.399  6.22  10.262
 20.224  2.937  3.443  9.966  7.632]
