# A performant data container for complex time series

In this notebook, we will explore an experimental pandas-based data container meant for easy and fast manipulation of multiseries, multivariate time series. The goal is to create a data structure that can contain arbitrarily complex time series together with static, time-independent information (e.g. target labels) and meta-data (e.g. measurement frequency or length of time series).

The approach presented here extends pandas' capabilities to store each time series (i.e. a list of timestamp-value pairs) in a single cell. 

A specification of desired properties of such a data container can be found [in the sktime wiki](https://github.com/alan-turing-institute/sktime/wiki/Time-series-data-container).






# Example 1: Univariate time series

The following code will use a first, rudamentary implementation of a data container to run the [Univariate time series classification](https://alan-turing-institute.github.io/sktime/examples/02_classification_univariate.html) tutorial. Using data from the [GunPoint problem](http://timeseriesclassification.com/description.php?Dataset=GunPoint), it will apply random segmentation and feature extraction to the raw data and then use a single DecisionTree to predict whether an actor draws a gun prop from a holster or just mimes the action.

## Preliminaries

In [1]:
# Standard libraries
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

from sktime.datasets import load_gunpoint
from sktime.utils.time_series import time_series_slope

# Custom implementation of a rudimentary data container for equally spaced time series
from container import TimeFrame
from container.transformers import RandomIntervalFeatureExtractor

## Current data container

At the moment, sktime uses a nested pandas DataFrame to store arbitrary time series data within a column. Each cell in the `dim_0` column is a pandas Series itself, making the entire column a "Series-of-Series". While this is very flexible approach, it can make the data container awkward to work with, results in suboptimal printing in the console, and can be slow for large datasets since most operations on it require iteration.

In [2]:
X_nested = load_gunpoint(return_X_y=False)
X_nested.reset_index(inplace=True, drop=True) # due to a bug only sequential DataFrame indices 0, 1, ... are currently supported
print(X_nested[:3])

                                               dim_0 class_val
0  0     -0.64789
1     -0.64199
2     -0.63819
3...         2
1  0     -0.64443
1     -0.64540
2     -0.64706
3...         2
2  0     -0.77835
1     -0.77828
2     -0.77715
3...         1


Even if each cell contains a time series of equal length (in this case 150 timesteps), determining this fact requires calling length on each timeseries.

In [3]:
print(X_nested.dim_0.shape)
print(X_nested.dim_0.values.shape)

(200,)
(200,)


In [4]:
np.all(X_nested.dim_0.apply(len) == 150)

True

In [5]:
print(X_nested.dtypes)

dim_0        object
class_val    object
dtype: object


Subsetting these time series might also behave in unexpected ways, implicitely unnesting the dataset into a tabular format with one column per timestep.

In [6]:
X_nested.dim_0.apply(lambda x: x[:10])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.64789,-0.64199,-0.63819,-0.63826,-0.63835,-0.63870,-0.64305,-0.64377,-0.64505,-0.64712
1,-0.64443,-0.64540,-0.64706,-0.64749,-0.64691,-0.64388,-0.63973,-0.63809,-0.63530,-0.63538
2,-0.77835,-0.77828,-0.77715,-0.77768,-0.77590,-0.77242,-0.76546,-0.76228,-0.76375,-0.76536
3,-0.75006,-0.74810,-0.74616,-0.74593,-0.74377,-0.74381,-0.74521,-0.74508,-0.74573,-0.74582
4,-0.59954,-0.59742,-0.59927,-0.59826,-0.59758,-0.59130,-0.58902,-0.58753,-0.58546,-0.58385
...,...,...,...,...,...,...,...,...,...,...
195,-0.58001,-0.58333,-0.58611,-0.58912,-0.59195,-0.59920,-0.60929,-0.61850,-0.62716,-0.63600
196,-0.72815,-0.73024,-0.73356,-0.73419,-0.73433,-0.73466,-0.73386,-0.73340,-0.73293,-0.73182
197,-0.73801,-0.73630,-0.73123,-0.72846,-0.72888,-0.72737,-0.72453,-0.72092,-0.71983,-0.71923
198,-1.26510,-1.25610,-1.25940,-1.25640,-1.25330,-1.26010,-1.26510,-1.25640,-1.24640,-1.24910


In [7]:
X_subset = X_nested.copy()
X_subset['dim_0'] = [x[:10] for x in X_subset['dim_0']]
print(X_subset[:3])
print(f"Maximum time series length {max(X_subset.dim_0.apply(len))}")

                                               dim_0 class_val
0  0   -0.64789
1   -0.64199
2   -0.63819
3   -0....         2
1  0   -0.64443
1   -0.64540
2   -0.64706
3   -0....         2
2  0   -0.77835
1   -0.77828
2   -0.77715
3   -0....         1
Maximum time series length 10


## Proposed approach

The proposed approach tries to alleviate these problems by storing the time series as a single array in the background. In the simple implementation shown here, this can simply be two numpy arrays (one for the time index and one for the data).

In [8]:
X = TimeFrame(X_nested, copy=True)
X.head()

Unnamed: 0,dim_0,class_val
0,"[0: -0.64789, 1: -0.64199, ...]",2
1,"[0: -0.64443, 1: -0.6454, ...]",2
2,"[0: -0.77835, 1: -0.77828, ...]",1
3,"[0: -0.75006, 1: -0.7481, ...]",1
4,"[0: -0.59954, 1: -0.59742, ...]",2


In [9]:
X.info()

<class 'container.timeframe.TimeFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype     
---  ------     --------------  -----     
 0   dim_0      200 non-null    timeseries
 1   class_val  200 non-null    object    
dtypes: object(1), timeseries(1)
memory usage: 236.1+ KB


Storing the data in this way has the advantage that the information is stored within fast array-implementations in the background, alleviating the need to loop over pandas Series objects. The column also implicitely knows its column type (=timeseries).

In [10]:
X.dim_0.data.shape  # the column knows its 2D shape, unlike above

(200, 150)

In [11]:
print(X.dtypes)

dim_0        timeseries
class_val        object
dtype: object


Using a custom pandas subclass further allows us to implement fast versions of common functions for time series manipulation such as tabularisation and slicing in the time dimension.

In [12]:
# Tabularise the structure, i.e. convert to a 2D pandas.DataFrame
X.tabularise()

Unnamed: 0,dim_0_0,dim_0_1,dim_0_2,dim_0_3,dim_0_4,dim_0_5,dim_0_6,dim_0_7,dim_0_8,dim_0_9,...,dim_0_141,dim_0_142,dim_0_143,dim_0_144,dim_0_145,dim_0_146,dim_0_147,dim_0_148,dim_0_149,class_val
0,-0.64789,-0.64199,-0.63819,-0.63826,-0.63835,-0.63870,-0.64305,-0.64377,-0.64505,-0.64712,...,-0.63972,-0.63973,-0.64018,-0.63923,-0.63939,-0.64023,-0.64043,-0.63867,-0.63866,2
1,-0.64443,-0.64540,-0.64706,-0.64749,-0.64691,-0.64388,-0.63973,-0.63809,-0.63530,-0.63538,...,-0.64143,-0.63927,-0.63780,-0.63768,-0.63526,-0.63549,-0.63493,-0.63450,-0.63160,2
2,-0.77835,-0.77828,-0.77715,-0.77768,-0.77590,-0.77242,-0.76546,-0.76228,-0.76375,-0.76536,...,-0.71871,-0.71353,-0.71002,-0.70413,-0.70326,-0.70339,-0.70420,-0.70761,-0.70712,1
3,-0.75006,-0.74810,-0.74616,-0.74593,-0.74377,-0.74381,-0.74521,-0.74508,-0.74573,-0.74582,...,-0.72466,-0.72923,-0.72894,-0.72783,-0.72824,-0.72645,-0.72552,-0.72519,-0.72468,1
4,-0.59954,-0.59742,-0.59927,-0.59826,-0.59758,-0.59130,-0.58902,-0.58753,-0.58546,-0.58385,...,-0.64388,-0.64574,-0.64646,-0.64646,-0.64558,-0.64241,-0.64334,-0.63680,-0.63172,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,-0.58001,-0.58333,-0.58611,-0.58912,-0.59195,-0.59920,-0.60929,-0.61850,-0.62716,-0.63600,...,-0.53771,-0.53755,-0.53832,-0.53892,-0.54154,-0.54584,-0.54662,-0.54883,-0.55355,2
196,-0.72815,-0.73024,-0.73356,-0.73419,-0.73433,-0.73466,-0.73386,-0.73340,-0.73293,-0.73182,...,-0.76871,-0.74816,-0.72856,-0.71113,-0.69960,-0.68958,-0.68748,-0.68645,-0.69018,1
197,-0.73801,-0.73630,-0.73123,-0.72846,-0.72888,-0.72737,-0.72453,-0.72092,-0.71983,-0.71923,...,-0.61386,-0.61159,-0.60978,-0.60981,-0.60885,-0.61002,-0.60965,-0.60862,-0.61218,2
198,-1.26510,-1.25610,-1.25940,-1.25640,-1.25330,-1.26010,-1.26510,-1.25640,-1.24640,-1.24910,...,-1.19330,-1.19570,-1.18960,-1.17710,-1.18800,-1.18960,-1.20000,-1.19340,-1.19280,2


In [13]:
# Segment across the time axis
X.dim_0.slice_time(np.arange(5, 10))

0       [5: -0.6387, 6: -0.64305, ...]
1      [5: -0.64388, 6: -0.63973, ...]
2      [5: -0.77242, 6: -0.76546, ...]
3      [5: -0.74381, 6: -0.74521, ...]
4       [5: -0.5913, 6: -0.58902, ...]
                    ...               
195     [5: -0.5992, 6: -0.60929, ...]
196    [5: -0.73466, 6: -0.73386, ...]
197    [5: -0.72737, 6: -0.72453, ...]
198      [5: -1.2601, 6: -1.2651, ...]
199      [5: -1.2644, 6: -1.2715, ...]
Name: dim_0, Length: 200, dtype: timeseries

Making only small adjustments to the existing sktime code base, we can use this new data container to run a TimeSeriesClassificationTree

In [14]:
# Build a classification pipeline using this new data container
# https://alan-turing-institute.github.io/sktime/examples/02_classification_univariate.html#Composable-time-series-forest
X_train, X_test = train_test_split(X)

steps = [
    ('extract', RandomIntervalFeatureExtractor(n_intervals='sqrt',
                                               features=[np.mean, np.std, time_series_slope])),
    ('clf', DecisionTreeClassifier())
]

time_series_tree = Pipeline(steps)
time_series_tree.fit(X_train[['dim_0']], X_train[['class_val']])
time_series_tree.score(X_test[['dim_0']], X_test[['class_val']])

0.9