# Using `aeon` distances with `scikit-learn`

`scikit-learn` has a range of algorithms based on distances, including classifiers,
regressors and clusterers. These can generally all be used with `aeon` distances
in two ways:

1. Pass the distance function as a callable to the `metric` or `kernel` parameter
in the constructor or
2. Set the `metric` or `kernel` to precomputed in the constructor and pass a
pairwise distance matrix to `fit` and `predict`.

## K-Nearest Neighbour Univariate Classification in sklearn.neighbors

In [1]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsTransformer

from aeon.datasets import load_gunpoint

# Load the gunpoint dataset as a 2D numpy array
train_x_2D, train_y_2D = load_gunpoint(split="TRAIN", return_type="numpy2D")
test_x_2D, test_y_2D = load_gunpoint(split="TEST", return_type="numpy2D")

# Load the gunpoint dataset as a 3D numpy array
train_x_3D, train_y_3D = load_gunpoint(split="TRAIN")
test_x_3D, test_y_3D = load_gunpoint(split="TEST")

If we have a univariate problem stored as a 2D numpy shape
`(n_cases,n_timepoints)`, we can apply these estimators directly,
but it is treating the data as vector rather than as time series.

If we try and use with an `aeon` style 3D numpy
`(n_cases, n_channels, n_timepoints)`, they will crash as `scikit-learn` expect a 2D 
numpy array. See the [data_formats](../datasets/data_structures.ipynb) for details on 
data storage.

In [30]:
# Using the 2D array format
print(f"Training set shape = {train_x_3D.shape} -> this works with sklearn")

# Apply a sklearn kNN classifier on the 2D time series data using a standard distance
knn = KNeighborsClassifier(metric="manhattan")
knn.fit(train_x_2D, train_y_2D)
predictions_2D = knn.predict(test_x_2D[:5])
print(f"kNN with manhattan distance on 2D time series data {predictions_2D}\n")


# Now using the 3D array format
print(f"Training set shape = {train_x_3D.shape} -> sklearn will crash as is a 3D array")

# Apply a sklearn kNN classifier on the 3D time series data using a standard distance
# This will raise a ValueError as sklearn does not support 3D arrays
try:
    knn.fit(train_x_3D, train_y_3D)
except ValueError as e:
    print(f"Raises this ValueError:\n\t{e}")

Training set shape = (50, 1, 150) -> this works with sklearn
kNN with manhattan distance on 2D time series data ['1' '2' '2' '1' '1']

Training set shape = (50, 1, 150) -> sklearn will crash as is a 3D array
Raises this ValueError:
	Found array with dim 3. KNeighborsClassifier expected <= 2.


We can use `KNeighborsClassifier` with a callable `aeon` distance function, but the 
input must still be 2D numpy array. 

In [29]:
from aeon.distances import dtw_distance, edr_distance, msm_distance, twe_distance

# Apply a sklearn kNN classifier on the 2D time series data using the DTW distance
knn = KNeighborsClassifier(metric=dtw_distance)
knn.fit(train_x_2D, train_y_2D)
predictions_2D_DTW = knn.predict(test_x_2D[:5])

print(f"kNN with DTW distance on 2D time series data {predictions_2D_DTW}\n")


# Apply a sklearn kNN classifier on the 3D time series data using the DTW distance
# This will still raise a ValueError as sklearn does not support 3D arrays
print("kNN with DTW distance on 3D time series data...")
try:
    knn.fit(train_x_3D, train_y_3D)
except ValueError as e:
    print(f"...raises this ValueError:\n\t{e}")

kNN with DTW distance on 2D time series data ['1' '2' '2' '1' '2']

kNN with DTW distance on 3D time series data...
...raises this ValueError:
	Found array with dim 3. KNeighborsClassifier expected <= 2.


We can also use `aeon` distance functions as callables in other `sklearn.neighbors`
estimators:

In [31]:
# Transform X into a graph of k nearest neighbors on the 2D time series data using the
# EDR distance
kt = KNeighborsTransformer(
    metric=edr_distance,
    metric_params={"itakura_max_slope": 0.5},
)

kt.fit(train_x_2D, train_y_2D)
kgraph = kt.transform(test_x_2D[:1]).toarray()  # Convert the sparse matrix to an array

print(
    "Graph of neighbors for the first pattern in testing set with EDR distance on 2D"
    f"time series data:\n{kgraph}\nNote that [i,j] has the weight of edge that "
    "connects i to j.\n"
)

# Again, using a 3D array will raise a ValueError
print("Again, transforming 3D time series data into a graph of neighbors...")
try:
    kt.fit(train_x_3D, train_y_3D)
except ValueError as e:
    print(f"...raises this ValueError:\n\t{e}")

Graph of neighbors for the first pattern in testing set with EDR distance on 2Dtime series data:
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.00666667 0.00666667
  0.00666667 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]]
Note that [i,j] has the weight of edge that connects i to j.

Again, transforming 3D time series data into a graph of neighbors...
...raises this ValueError:
	Found array with dim 3. KNeighborsTransformer expected <= 2.


Also note that using an `aeon` distance function as callable does not will not work with 
some kNN options such as [`KDTree`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html) 
class or [`BallTree`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html),
as stated in the scikit-learn doc of these classes:

_Note: Callable functions in the metric parameter are NOT supported for KDTree_
_and Ball Tree. Function call overhead will result in very poor performance._

Because of these problems, we have implemented a KNN classifier and regressor to use 
with our distance functions.

The `aeon` kNN classifier using a 3D numpy array achieves the same performance than the 
`sklearn` one using the 2D numpy array even using time series specific distance 
functions. The results achieved are the same as time series are univariate and hence,
the data can be formatted as a 2D array:

In [68]:
from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier

# Apply the aeon kNN classifier on the 3D time series data using the MSM distance
knn_aeon = KNeighborsTimeSeriesClassifier(distance="msm")
knn_aeon.fit(train_x_3D, train_y_3D)

predictions_3D_aeon = knn_aeon.predict(test_x_3D[:5])

print(f"aeon kNN with MSM distance on 3D time series data {predictions_3D_aeon}")

# Apply a sklearn kNN classifier on the 2D time series data using the MSM distance
knn = KNeighborsClassifier(metric=msm_distance)
knn.fit(train_x_2D, train_y_2D)
predictions_2D_sklearn = knn.predict(test_x_2D[:5])

print(f"sklearn kNN with MSM distance on 2D time series data {predictions_2D_sklearn}")

aeon kNN with MSM distance on 3D time series data ['1' '2' '2' '1' '1']
sklearn kNN with MSM distance on 2D time series data ['1' '2' '2' '1' '1']


## K-Nearest Neighbour Multivariate Classification in sklearn.neighbors

However, if the time series dataset is a multivariate one, data has to be represented 
using a 3D numpy array. In this case, to use the `sklearn` knn approach, channels have
to be concatenated, and therefore, specific edit time series distances may compute the
distance between values of different channels, and the results may be biased by these
misleading implementation:

In [69]:
from aeon.datasets import load_basic_motions

# Load the basic_motions multivariate (MTSC) dataset as a 3D numpy array
train_x_3D_mtsc, train_y_mtsc = load_basic_motions(split="TRAIN")
test_x_3D_mtsc, test_y_mtsc = load_basic_motions(split="TEST")

print(f"3D training set shape = {train_x_3D_mtsc.shape} -> does not work with sklearn")

# Transform the 3D numpy array to 2D concatenating the time series
# This time, the loader does not return the dataset as a 2D array as this is not an
# intended way of working with time series. We need to reshape it ourselves.
train_x_2D_mtsc = train_x_3D_mtsc.reshape(train_x_3D_mtsc.shape[0], -1)
test_x_2D_mtsc = test_x_3D_mtsc.reshape(test_x_3D_mtsc.shape[0], -1)

# shuffle test datasets
from sklearn.utils import shuffle

train_x_3D_mtsc, test_y_mtsc = shuffle(train_x_3D_mtsc, test_y_mtsc, random_state=0)
test_x_2D_mtsc, test_y_mtsc = shuffle(test_x_2D_mtsc, test_y_mtsc, random_state=0)

print(f"2D Training set shape = {train_x_2D_mtsc.shape} -> works with sklearn")

3D training set shape = (40, 6, 100) -> does not work with sklearn
2D Training set shape = (40, 600) -> works with sklearn


In [70]:
# Apply the aeon kNN classifier on the 3D MTSC time series data using the MSM distance
knn_aeon = KNeighborsTimeSeriesClassifier(distance="msm")
knn_aeon.fit(train_x_3D_mtsc, train_y_mtsc)

predictions_3D_aeon = knn_aeon.predict(test_x_3D_mtsc[:5])

print(f"aeon kNN with MSM distance on 3D MTSC time series data {predictions_3D_aeon}")

# Apply a sklearn kNN classifier on the concatenated 2D MTSC time series data using the
# MSM distance
knn = KNeighborsClassifier(metric=msm_distance)
knn.fit(train_x_2D_mtsc, train_y_mtsc)
predictions_2D_sk = knn.predict(test_x_2D_mtsc[:5])

print(f"sklearn kNN with MSM distance on 2D MTSC time series data {predictions_2D_sk}")

aeon kNN with MSM distance on 3D MTSC time series data ['walking' 'standing' 'walking' 'standing' 'walking']
sklearn kNN with MSM distance on 2D MTSC time series data ['walking' 'walking' 'walking' 'standing' 'running']


## K-Nearest Neighbour Univariate Regression in sklearn.neighbors

Similar conclusions can be drawn for the kNN regressor. First of all, we load the 
TSER dataset.

In [76]:
from sklearn.neighbors import KNeighborsRegressor

from aeon.datasets import load_covid_3month
from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor

# Load the Covid3Month dataset as a 3D numpy array
train_x_3D_reg, train_y_3D_reg = load_covid_3month(split="train")
test_x_3D_reg, test_y_3D_reg = load_covid_3month(split="test")

# Load the Covid3Month dataset as a 2D numpy array
train_x_2D_reg, train_y_2D_reg = load_covid_3month(split="train", return_type="numpy2D")
test_x_2D_reg, test_y_2D_reg = load_covid_3month(split="test", return_type="numpy2D")

Now, we compare the prediction of the `aeon` and `scikit-learn` versions. As the 
Covid3Month dataset is univariate, the results of both libraries should be the same.

With respect to multivariate TSER datasets, same conclusions are obtained. **We do not
recommend concatenating channels as a regular practice.**

In [78]:
knn_aeon_reg = KNeighborsTimeSeriesRegressor(distance="twe", n_neighbors=1)
knn_aeon_reg.fit(train_x_3D_reg, train_y_3D_reg)

predictions_3D_reg_aeon = knn_aeon_reg.predict(test_x_3D_reg[:5])

print(
    f"aeon kNN with MSM distance on 3D TSER time series data {predictions_3D_reg_aeon}"
)

knn_sklearn = KNeighborsRegressor(metric=twe_distance, n_neighbors=1)
knn_sklearn.fit(train_x_2D_reg, train_y_2D_reg)

predictions_2D_reg_sk = knn_aeon_reg.predict(test_x_2D_reg[:5])

print(
    f"sklearn kNN with MSM distance on 2D TSER time series data {predictions_2D_reg_sk}"
)

aeon kNN with MSM distance on 3D TSER time series data [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]
sklearn kNN with MSM distance on 2D TSER time series data [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]


## K-Nearest Neighbour Classification and Regression in sklearn.neighbors with precomputed distances

Another alternative is to pass the distance measures as precomputed.
Note that this requires the calculation of the full distance matrices,
and still cannot be used with some other `scikit-learn` knn options.

In [93]:
from sklearn.metrics import accuracy_score

from aeon.distances import adtw_pairwise_distance

# Compute the distances between all pairs of time series in the training set
# and between the testing set and the training set for the testing set
train_dists = adtw_pairwise_distance(train_x_3D)
test_dists = adtw_pairwise_distance(test_x_3D, train_x_3D)

# scikit-learn KNN classifier with precomputed distances
knn = KNeighborsClassifier(metric="precomputed", n_neighbors=1)
knn.fit(train_dists, train_y_3D)
predictions_precomputed = knn.predict(test_dists)

print(f"sklearn kNN with precomputed ADTW distance {predictions_precomputed[:5]}")

# aeon KNN classifier with ADTW distance (not precomputed)
knn_aeon = KNeighborsTimeSeriesClassifier(distance="adtw", n_neighbors=1)
knn_aeon.fit(train_x_3D, train_y_3D)
predictions_aeon = knn_aeon.predict(test_x_3D)

print(f"aeon kNN with ADTW distance {predictions_aeon[:5]}")

# Compute the CCR on both experiments
CCR_precomputed = accuracy_score(test_y_3D, predictions_precomputed)
CCR_aeon = accuracy_score(test_y_3D, predictions_aeon)

print(f"{CCR_precomputed=}\n{CCR_aeon=}")

sklearn kNN with precomputed ADTW distance ['1' '2' '2' '1' '1']
aeon kNN with ADTW distance ['1' '2' '2' '1' '1']
CCR_precomputed=0.9133333333333333
CCR_aeon=0.9133333333333333


Same conclusions can be obtained when dealing with a TSER dataset. 

In [96]:
from sklearn.metrics import mean_squared_error

from aeon.distances import erp_pairwise_distance

# Compute the distances between all pairs of time series in the training set
# and between the testing set and the training set for the testing set
train_dists = erp_pairwise_distance(train_x_3D_reg)
test_dists = erp_pairwise_distance(test_x_3D_reg, train_x_3D_reg)

# scikit-learn KNN regressor with precomputed distances
knn = KNeighborsRegressor(metric="precomputed", n_neighbors=1)
knn.fit(train_dists, train_y_3D_reg)
predictions_precomputed = knn.predict(test_dists)

print(f"sklearn kNN with precomputed ERP distance {predictions_precomputed[:5]}")

# aeon KNN regressor with ERP distance (not precomputed)
knn_aeon = KNeighborsTimeSeriesRegressor(distance="erp", n_neighbors=1)
knn_aeon.fit(train_x_3D_reg, train_y_3D_reg)
predictions_aeon = knn_aeon.predict(test_x_3D_reg)

print(f"aeon kNN with ERP distance {predictions_aeon[:5]}")

# Compute the CCR on both experiments
MSE_precomputed = mean_squared_error(test_y_3D_reg, predictions_precomputed)
MSE_aeon = mean_squared_error(test_y_3D_reg, predictions_aeon)

print(f"{MSE_precomputed=}\n{MSE_aeon=}")

sklearn kNN with precomputed ERP distance [0.02558824 0.05594406 0.01449275 0.03533314 0.12759489]
aeon kNN with ERP distance [0.02558824 0.05594406 0.01449275 0.03533314 0.12759489]
MSE_precomputed=0.002247674986547397
MSE_aeon=0.002247674986547397


## Support Vector Machine Classification in sklearn.svm

In [97]:
from sklearn.svm import SVC, NuSVC

The SVM estimators in `scikit-learn` can be used with pairwise distance matrices. Please 
note that not all elastic distance functions are kernels, and it is desirable that they 
are for SVM. DTW is not a metric, but MSM and TWE are.

In [99]:
from aeon.distances import msm_pairwise_distance, twe_distance, twe_pairwise_distance

svc = SVC(kernel="precomputed")
nsvc = NuSVC(kernel="precomputed")
train_m1 = twe_pairwise_distance(train_x_2D)
test_m1 = twe_pairwise_distance(test_x_2D, train_x_2D)
svc.fit(train_m1, train_y_2D)
print("SVC with TWE first five predictions= ", svc.predict(test_m1)[:5])
train_m2 = msm_pairwise_distance(train_x_3D)
test_m2 = msm_pairwise_distance(test_x_3D, train_x_3D)
nsvc.fit(train_m2, train_y_3D)
print("NuSVC with MSM first five predictions= ", nsvc.predict(test_m2)[:5])

SVC with TWE first five predictions=  ['1' '2' '1' '2' '2']
NuSVC with MSM first five predictions=  ['1' '2' '2' '1' '2']


## Support Vector Machine Regression in sklearn.svm

In [None]:
from sklearn.svm import SVR, NuSVR

In [100]:
from aeon.distances import dtw_pairwise_distance

svr = SVR(kernel="precomputed")
nsvr = NuSVR(kernel=twe_distance)

train_m3 = dtw_pairwise_distance(train_x_3D_reg)
test_m3 = dtw_pairwise_distance(test_x_3D_reg, train_x_3D_reg)
svr.fit(train_m3, train_y_3D_reg)
print("SVR with DTW first five predictions= ", svr.predict(test_m3)[:5])

SVR with DTW first five predictions=  [0.08823529 0.08823529 0.08823529 0.08823529 0.08823529]


# Clustering with sklearn.cluster

In [102]:
from sklearn.cluster import DBSCAN

Some sklearn clustering algorithms accept callable distances or precomputed distance
matrices, and these can be used with aeon distance functions.

Note that DBSCAN can only make predictions on the train data, so it has no predict
function.


In [103]:
db1 = DBSCAN(eps=2.5)
preds1 = db1.fit_predict(train_x_2D)
print(preds1[:5])
db2 = DBSCAN(metric=msm_distance, eps=2.5)
db3 = DBSCAN(metric="precomputed", eps=2.5)
preds2 = db2.fit_predict(train_x_2D)
print(preds1[:5])
preds3 = db3.fit_predict(train_m2)
print(preds1[:5])

[-1  0  0  0  0]
[-1  0  0  0  0]
[-1  0  0  0  0]


You can use pairwise distance functions within the scikit learn `FunctionTransformer`
 wrapper

In [104]:
from sklearn.preprocessing import FunctionTransformer

from aeon.datasets import load_italy_power_demand
from aeon.distances import msm_distance, msm_pairwise_distance

X, y = load_italy_power_demand(split="TRAIN")
ft = FunctionTransformer(msm_pairwise_distance)
X2 = ft.transform(X)
print(" Shape = ", X2.shape)
d = msm_distance(X[0], X[1])
print(f"These should be the same {d} and {X2[0][1]}")

 Shape =  (67, 67)
These should be the same 7.595223506000001 and 7.595223506000001


This makes it easy to use distances as features in an a scikit-learn pipeline.
Whether it is a good idea to do this is a separate question.


In [105]:
from sklearn.pipeline import Pipeline

X, y = load_italy_power_demand(split="TRAIN")

pipe = Pipeline(steps=[("FunctionTransformer", ft), ("SVM", SVC())])
pipe.fit(X, y)
pipe.predict(X)

array(['1', '1', '2', '2', '1', '2', '2', '1', '1', '2', '2', '1', '1',
       '2', '1', '2', '1', '1', '2', '1', '1', '2', '1', '1', '1', '1',
       '1', '2', '2', '1', '1', '2', '2', '1', '2', '2', '1', '2', '1',
       '2', '1', '1', '2', '2', '1', '2', '2', '2', '2', '1', '1', '2',
       '2', '2', '1', '2', '2', '1', '1', '2', '2', '1', '1', '2', '1',
       '2', '2'], dtype='<U1')