# Storing data to use for aeon estimators

`aeon` includes multiple time series tasks such as forecasting and machine learning
(i.e. classification, regression and clustering). These two communities have different
conventions and requirements for storing data and what to call data structures. We try
to accommodate for both, which leads to some differences between `aeon` packages. Some
differences are:

1. Forecasters almost always store data in pandas data structures internally, whereas machine
 learners use numpy arrays almost exclusively.
2. Most forecasting estimators (but not all) will take a single series as a 1D or 2D array-like
 as the data to learn from, whereas machine learning estimators will take a collection of series
 as a 3D or 2D array-like.
3. In forecasting 2D arrays are almost always single series of shape `(n_timepoints, n_channels)`
 whereas in machine learning we would tend to store data in a `(n_cases, n_timepoints)`
 collection of series.
4. In forecasting, a variable `y` refers to a time series for which we are attempting
 to make a forecast, hence `y` is assumed to be ordered. In machine learning,
 `y` is a list of either class labels (for classification) or observations of a
 response variables (for regression). The ordering of values in `y` is determined by
 the ordering of the `X` input.

Because of these sources of confusion, we recommend carefully reading the documentation for the task
prior to usage to ensure you are using the correct input data type. We also recommend that you store
data in pandas data structures for forecasting and numpy arrays for machine learning tasks. All of
our accepted input types can be used given they are compatible with the algorithms (see the
[data conversions notebook](examples/datasets/data_conversions.ipynb) for more accepted types), but
keeping to the recommended types is likely to reduce the number of data conversions and make finding help
easier.

In the following, we provide guidance and examples for storing data for forecasting and machine learning
using our recommended data types.

## Forecasting data

The `aeon` forecasting module primarily uses pd.Series, pd.DataFrame and pd.Multiindex to store data.
It has some built in forecasting datasets and tools for downloading commonly used benchmarks, see the
[data loading notebook](examples/datasets/loading_data.ipynb.ipynb) forecasting section. For details of
the forecasting functionality, see the [forecasting user guide](examples/forecasting/forecasting.ipynb)
and the numerous forecasting notebooks on the [examples page](examples).

`pd.Series` are used to store a univariate time series with entries corresponding to
different time points.

In [19]:
# Forecasting data in a pandas.Series
import numpy as np
import pandas as pd

y = pd.Series([20.0, 40.0, 60.0, 80.0, 100.0])
y

0     20.0
1     40.0
2     60.0
3     80.0
4    100.0
dtype: float64

In [20]:
from aeon.forecasting.trend import TrendForecaster

tf = TrendForecaster()
tf.fit(y)  # fit the forecaster
tf.predict(fh=[1, 2, 3])  # forecast the next 3 values

5    120.0
6    140.0
7    160.0
dtype: float64

`pd.DataFrame` are used to store multiple time series, where each column is a time
series, and each row corresponds to a different, distinct time point. The index
is the time point and should be monotonic. This creates two series called Sales and
Temperature, and stores observations for time points 0,1,2,3,4,5.

In [21]:
ice_creams = {
    "Sales": [111, 100, 90, 80, 65, 89],
    "Temperature": [26, 21, 19, 14, 12, 22],
}
# Create DataFrame
ice_creams = pd.DataFrame(ice_creams)
ice_creams

Unnamed: 0,Sales,Temperature
0,111,26
1,100,21
2,90,19
3,80,14
4,65,12
5,89,22


In [22]:
from aeon.forecasting.exp_smoothing import ExponentialSmoothing

es = ExponentialSmoothing()
es.fit(ice_creams)
es.predict(fh=[1, 2, 3])

Unnamed: 0,Sales,Temperature
6,89.0,22.0
7,89.0,22.0
8,89.0,22.0


You can add a date-time index, and this is required by some forecasters (e.g. Prophet).

In [23]:
ice_creams["datetime"] = pd.to_datetime(
    [
        "01-06-2018 23:15:00",  # Creating data
        "02-09-2019 01:48:00",
        "08-06-2020 13:20:00",
        "07-03-2021 14:50:00",
        "07-06-2022 11:50:00",
        "03-05-2023 16:50:00",
    ]
)
ice_creams = ice_creams.set_index("datetime")
ice_creams

Unnamed: 0_level_0,Sales,Temperature
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-06 23:15:00,111,26
2019-02-09 01:48:00,100,21
2020-08-06 13:20:00,90,19
2021-07-03 14:50:00,80,14
2022-07-06 11:50:00,65,12
2023-03-05 16:50:00,89,22


`pd.DataFrame` also have the capability to store multiple indexes, which can be used
to represent what's called Panel data in forecasting hierarchical data. A Panel is a
collection of (possibly) multivariate data.

In [24]:
from aeon.utils._testing.hierarchical import _make_hierarchical

y = _make_hierarchical()
y

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c0
h0,h1,time,Unnamed: 3_level_1
h0_0,h1_0,2000-01-01,4.249534
h0_0,h1_0,2000-01-02,2.899939
h0_0,h1_0,2000-01-03,2.671320
h0_0,h1_0,2000-01-04,4.380220
h0_0,h1_0,2000-01-05,5.538047
...,...,...,...
h0_1,h1_3,2000-01-08,3.658460
h0_1,h1_3,2000-01-09,3.672319
h0_1,h1_3,2000-01-10,2.938018
h0_1,h1_3,2000-01-11,2.902982


In [25]:
es.fit(y, fh=[1, 2]).predict()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c0
h0,h1,time,Unnamed: 3_level_1
h0_0,h1_0,2000-01-13,4.200625
h0_0,h1_0,2000-01-14,4.200625
h0_0,h1_1,2000-01-13,3.7145
h0_0,h1_1,2000-01-14,3.7145
h0_0,h1_2,2000-01-13,3.982618
h0_0,h1_2,2000-01-14,3.982618
h0_0,h1_3,2000-01-13,3.911963
h0_0,h1_3,2000-01-14,3.911963
h0_1,h1_0,2000-01-13,3.627664
h0_1,h1_0,2000-01-14,3.627664


`np.ndarray` can be used with the forecasters in aeon, although we recommend using
pandas. One-dimensional np.ndarray are treated as a single time series. 2D numpy
arrays are treated as multiple series of shape `(n_timeseries, n_timepoints)`.

In [26]:
y = np.array([20.0, 40.0, 60.0, 80.0, 100.0])
forecaster = TrendForecaster()
forecaster.fit(y)  # fit the forecaster
forecaster.predict(fh=[1, 2, 3])  # forecast the next 3 values

array([[120.],
       [140.],
       [160.]])

In [27]:
y = np.array([[20.0, 40.0, 60.0, 80.0, 100.0], [100.0, 90.0, 80.0, 70.0, 60.0]])
y = y.transpose()
forecaster = TrendForecaster()
forecaster.fit(y)  # fit the forecaster
forecaster.predict(fh=[1, 2, 3])  # forecast the next 3 values

array([[120.,  50.],
       [140.,  40.],
       [160.,  30.]])

## Machine learning data

Machine learning algorithms generally use collections of instances or cases stored as
 numpy arrays. Like scikit-learn, pytorch and keras, we primarily use numpy arrays.
 A collection contains a number of time series cases (or just cases) which we refer
 to in code as `n_cases`. Each case contains a number of time series observations,
 which we denote `n_timepoints`.

In [28]:
X = np.array(
    [
        [[20.0, 40.0, 60.0, 80.0, 100.0]],  # Univariate series as 3D array
        [[100.0, 90.0, 80.0, 70.0, 60.0]],
    ]
)  # n_cases = 2, n_channels =1, n_timepoints = 5
print("X shape = ", X.shape, " First series =", X[0], "second series = ", X[1])

X shape =  (2, 1, 5)  First series = [[ 20.  40.  60.  80. 100.]] second series =  [[100.  90.  80.  70.  60.]]


In [29]:
X = np.array(
    [
        [[20, 40, 600, 55], [10, 11, 12, 11], [-4, 1, 6.6, 2]],
        [[10, 90, 80, 100], [14, 70, 60, 22], [49, 49, 66, 9]],
        [[14, 6, 10, -401], [44, 70, 60, 22], [49, 52, 33, 49]],
        [[22, 93, 18, 100], [34, 170, 0, 87], [49, 49, 33, 49]],
    ]
)
# n_cases = 4, n_channels =3, n_timepoints = 4
print("X shape = ", X.shape, "\n First series =\n", X[0], "\nsecond series = \n", X[1])

X shape =  (4, 3, 4) 
 First series =
 [[ 20.   40.  600.   55. ]
 [ 10.   11.   12.   11. ]
 [ -4.    1.    6.6   2. ]] 
second series = 
 [[ 10.  90.  80. 100.]
 [ 14.  70.  60.  22.]
 [ 49.  49.  66.   9.]]


In [30]:
from aeon.clustering.k_means import TimeSeriesKMeans

kmeans = TimeSeriesKMeans(metric="euclidean", n_clusters=2)
kmeans.fit(X)
kmeans.predict(X)

array([0, 1, 1, 1], dtype=int64)

The target variable for classification should be stored as a np.ndarray of integers
or strings

In [31]:
y = np.array([1, 1, 0, 0])
y2 = np.array(["pass", "pass", "fail", "fail"])
y2

array(['pass', 'pass', 'fail', 'fail'], dtype='<U4')

In [32]:
from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier

knn = KNeighborsTimeSeriesClassifier(distance="dtw")
knn.fit(X, y2)
knn.predict(X)

array(['pass', 'pass', 'fail', 'fail'], dtype='<U4')

For regression, the target variable should be of type float


In [33]:
y3 = np.array([1.5, 4.3, -2.0, 10])
y3

array([ 1.5,  4.3, -2. , 10. ])

In [34]:
from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor

knn_r = KNeighborsTimeSeriesRegressor(distance="dtw")
knn_r.fit(X, y)
knn_r.predict(X)

array([1., 1., 0., 0.])

If the time series are not all equal length, they should be stored as a list of 2D
numpy arrays. Some estimators can deal with unequal length series. Those that can't
will raise an exception if passed unequal length series. Note we assume that channels
are all the same length for any given series.

In [35]:
x0 = np.array([[20, 40, 60, 55, 66], [10, 11, 12, 11, 66], [-4, 15, 6.6, 12, 44]])
x1 = np.array([[10, 90, 80], [70, 60, 22], [49, 66, 9]])
x2 = np.array([[22, 93, 18, 100], [34, 170, 0, 87], [49, 49, 33, 49]])
X_uneq = []
X_uneq.append(x0)
X_uneq.append(x1)
X_uneq.append(x2)

X_uneq

[array([[20. , 40. , 60. , 55. , 66. ],
        [10. , 11. , 12. , 11. , 66. ],
        [-4. , 15. ,  6.6, 12. , 44. ]]),
 array([[10, 90, 80],
        [70, 60, 22],
        [49, 66,  9]]),
 array([[ 22,  93,  18, 100],
        [ 34, 170,   0,  87],
        [ 49,  49,  33,  49]])]

In [36]:
y = np.array([0, 0, 1])
knn.fit(X_uneq, y)
knn.predict(X_uneq)

array([0, 0, 1])

`aeon` has several standard problems baked in, and facilities for loading data from
external sources. Please see the [provided data notebook](examples/datasets/provided_data.ipynb)
and [data loading notebook](examples/datasets/data_loading.ipynb).