To demonstrate the ``SequenceFeature().feature_matrix()`` method, we load the ``DOM_GSEC`` example dataset including its respective features  (see [Breimann25a]_):

In [10]:
import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC")
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)

``features`` and ``df_parts`` must be provided to retrieve the feature matrix:

In [11]:
X = sf.feature_matrix(features=features, df_parts=df_parts)
print(f"n samples: {len(df_parts)}")
print(f"n features: {len(features)}")
# X has a shape of n_samples, n_features
print(f"Shape of X: {X.shape}")

n samples: 126
n features: 150
Shape of X: (126, 150)


If sequences in ``df_parts``, you can enable ``accept_gaps`` so that the feature values are computed as the average of the part-split combination ignoring gaps.

In [12]:
X = sf.feature_matrix(features=features, df_parts=df_parts, accept_gaps=True)

Multiprocessing can be used by using the ``n_jobs`` parameter, which is set to the maximum if ``n_jobs=None``. However, this is only recommend for more than ~1000 features per core due to potential process management overhead.   

In [13]:
import time

# Run without multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts)
time_no_mp = round(time.time() - time_start, 2)
print(f"Time without multiprocessing: {time_no_mp} seconds")

# Run with multiprocessing
time_start = time.time()
X = sf.feature_matrix(features=features, df_parts=df_parts, n_jobs=None)
time_mp = round(time.time() - time_start, 2)
print(f"Time with multiprocessing. {time_mp} seconds")

Time without multiprocessing: 0.48 seconds
Time with multiprocessing. 5.5 seconds
