# Feature-based Time Series Clustering in aeon

Feature-based time series clustering algorithms find descriptive features to represent the characteristics of time series and then perform clustering on the features. Various transformers can be used to derive features from the raw time-series data. Bespoke feature-based TSCL algorithms can be easily constructed with aeon transformers and sklearn clusterers in a pipeline. Currently, we have the following feature-based time series clusterers implemented in aeon:
1. `Catch22Clusterer`
2. `TSFreshClusterer`
3. `SummaryClusterer`

In [1]:
# Imports and load data
from sklearn.cluster import KMeans

from aeon.clustering import TimeSeriesKMeans
from aeon.clustering.feature_based import (
    Catch22Clusterer,
    SummaryClusterer,
    TSFreshClusterer,
)
from aeon.datasets import load_basic_motions

X_train, y_train = load_basic_motions()
X_test, y_test = load_basic_motions(split="test")

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(80, 6, 100) (80,) (40, 6, 100) (40,)


## 1. Catch22Clusterer

The `Catch22Clusterer` simply transforms the data into 22 features based on the `Catch22` transformer and then builds a sklearn estimator on the transformed data. The `Catch22` transformer transforms a `d` dimensional time-series into 22 CAnonical Time-series CHaracteristics derived from the 4791 filtered features of the *hctsa* feature library. `Catch22` is a diverse and interpretable set of time-series features, including linear and non-linear autocorrelation, successive differences etc.  

In [2]:
catch = Catch22Clusterer(estimator=KMeans(n_clusters=4))
catch.fit(X_train)

In [3]:
preds = catch.predict(X_train)
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 3, 3, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 3, 1, 2, 3, 3, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 3, 0, 3, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 3, 1, 2, 2, 3, 2, 3, 2, 3, 3])

## 2. TSFreshClusterer

The `TSFreshClusterer` transforms the data using the `TSFresh` transform and builds a sklearn estimator on the transformed data. The `TSFresh` transformer computes 794 time-series features and automates the feature extraction and selection based on the FeatuRe Extraction based on Scalable Hypothesis tests (FRESH) algorithm. The algorithm is efficient and scales linearly with the number of features.

In [4]:
tsfresh = TSFreshClusterer(estimator=KMeans(n_clusters=4))
tsfresh.fit(X_train)

In [5]:
preds = tsfresh.predict(X_train)
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

## 3. SummaryClusterer

Like the above algorithms, this clusterer transforms the input data using the `SevenNumberSummary` transformer and builds an estimator using the transformed data.

The default estimator is a Random Forest with 200 trees, but we can use other sklearn estimators or aeon `partition-based clusterers`.

In [6]:
summaryclst = SummaryClusterer(estimator=TimeSeriesKMeans(n_clusters=4))
summaryclst.fit(X_train)

In [7]:
preds = summaryclst.predict(X_test)
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int64)

## References

[1] Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. A Review and
Evaluation of Elastic Distance Functions for Time Series Clustering,  Knowledge and Information Systems. In Press (2023)

[2] Lubba, Carl H., et al. “catch22: Canonical time-series characteristics.” Data Mining and Knowledge Discovery 33.6 (2019): 1821-1852. https://link.springer.com/article/10.1007/s10618-019-00647-x

[3] Christ, Maximilian, et al. “Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package).” Neurocomputing 307 (2018): 72-77. https://www.sciencedirect.com/science/article/pii/S0925231218304843

[4] John Paparrizos, Fan Yang, and Haojun Li. 2018. Bridging the Gap: A Decade Review of Time-Series Clustering Methods. In Proceedings
 of Make sure to enter the correct conference title from your rights confirmation emai (Conference acronym ’XX). ACM, New York, NY,
 USA, 52 pages. https://arxiv.org/html/2412.20582v1