# Preprocessing time series with aeon

It is common to need to preprocess time series data before applying machine learning
algorithms. Transformers in `aeon` can be used to preprocess collections of time
series. This notebook demonstrates three common use cases

1. Rescaling time series
2. Resizing time series


## Rescaling time series

Different levels of scale and variance can mask discriminative patterns in time
series. This is particularly true for methods that are based on distances. It common
to rescale time series to have zero mean and unit variance. For example, the data in
the `UnitTest` dataset is a subset of the [Chinatown dataset]
(https://timeseriesclassification.com/description.php?Dataset=Chinatown. These are
counts of pedestrians in Chinatown, Melbourne. The time series are of different means

In [2]:
import numpy as np

from aeon.datasets import load_unit_test

X, y = load_unit_test(split="Train")
np.mean(X, axis=-1)[0:5]

array([[561.875     ],
       [604.95833333],
       [629.16666667],
       [801.45833333],
       [540.75      ]])

In [3]:
np.std(X, axis=-1)[0:5]

array([[428.95224215],
       [483.35481095],
       [514.90052977],
       [629.00847763],
       [389.10059218]])

We can rescale the time series in three ways:
1. Normalise: subtract the mean and divide by the standard deviation to make all
series have zero mean and unit variance.

In [4]:
from aeon.transformations.collection import Normalizer

normalizer = Normalizer()
X2 = normalizer.fit_transform(X)
np.round(np.mean(X2, axis=-1)[0:5], 6)

array([[ 0.],
       [-0.],
       [ 0.],
       [-0.],
       [-0.]])

In [5]:
np.round(np.std(X2, axis=-1)[0:5], 6)

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

2. Re-center: Recentering involves subtracting the mean of each series

In [6]:
from aeon.transformations.collection import Centerer

c = Centerer()
X3 = c.fit_transform(X)
np.round(np.mean(X3, axis=-1)[0:5], 6)

array([[ 0.],
       [-0.],
       [ 0.],
       [-0.],
       [ 0.]])

3. Min-Max: Scale the data to be between 0 and 1

In [7]:
from aeon.transformations.collection import MinMaxScaler

minmax = MinMaxScaler()
X4 = minmax.fit_transform(X)
np.round(np.min(X4, axis=-1)[0:5], 6)

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [8]:
np.round(np.max(X4, axis=-1)[0:5], 6)

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

There is no best way to do this, although for counts such as this it is more common
to MinMax scale, so that the data still has some interpretation as proportions.

# Resizing time series

Suppose we have a collections of time series with different lengths, i.e. different
number of time points. Currently, most of aeon's collection estimators
(classification, clustering or regression) require equal-length time
series. Those that can handle unequal length series are tagged with
"capability:unequal".

In [None]:
from aeon.classification.convolution_based import RocketClassifier
from aeon.datasets import load_basic_motions, load_plaid

If you want to use an estimator that cannot internally handle missing values, one
option is to convert unequal length series into equal length. This can be
 done through padding, truncation or resizing through fitting a function and
 resampling.

## Unequal or equal length collections time series

If a collection contains all equal length series, it will store the data in a 3D
numpy of shape `(n_cases, n_channels, n_timepoints)`. If it is unequal length, it is
stored in a list of 2D numpy arrays:

In [None]:
# Equal length multivariate data
bm_X, bm_y = load_basic_motions()
print(type(bm_X), "\n", bm_X.shape)

In [None]:
# Unequal length univariate data
plaid_X, plaid_y = load_plaid()
print(type(plaid_X), "\n", plaid_X[0].shape, "\n", plaid_X[10].shape)

If time series are unequal length, collection estimators will raise an error if they
do not have the capability to handle this characteristic.


In [None]:
rc = RocketClassifier()
try:
    rc.fit(plaid_X, plaid_y)
except ValueError as e:
    print(f"ValueError: {e}")

In [None]:
series_lengths = [array.shape[1] for array in plaid_X]

# Find the minimum and maximum of the second dimensions
min_length = min(series_lengths)
max_length = max(series_lengths)
print(" Min length = ", min_length, " max length = ", max_length)

# Padding, truncating or resizing.

We can pad, truncate or resize. By default, pad adds zeros to make all series the
length of the longest, truncate removes all values beyond the length of the shortest
and resize stretches or shrinks the series.

In [None]:
from aeon.transformations.collection import Padder, Resizer, Truncator

pad = Padder()
truncate = Truncator()
resize = Resizer(length=600)
X2 = pad.fit_transform(plaid_X)
X3 = truncate.fit_transform(plaid_X)
X4 = resize.fit_transform(plaid_X)
print(X2.shape, "\n", X3.shape, "\n", X4.shape)

You can put these transformers in a pipeline to apply to both train/test split


In [None]:
from sklearn.metrics import accuracy_score

# Unequal length univariate data
from aeon.pipeline import make_pipeline

train_X, train_y = load_plaid(split="Train")
test_X, test_y = load_plaid(split="Test")
steps = [truncate, rc]
pipe = make_pipeline(steps)
pipe.fit(train_X, train_y)
preds = pipe.predict(test_X)
accuracy_score(train_y, preds)