# Time Series Analysis with `sktime`

In this notebook, we analyze (multivariate) time series data with the 
[`sktime`](https://github.com/sktime/sktime) toolbox. 

In our context, we work with **event-based** time series data from Durst's printers. 
Such data has essentially four columns:

- `time` representing the time index,
- `printer_id` representing a specific printer,
- `sensor_id` representing a certain sensor/variable, and
- `signal_value` representing the value of a specific `sensor_id` at a certain `time` of 
`printer_id`.

---

CSV files store the data of a specific printer. The CSV choice is arbitrary.

In the next cell we:

1. read a CSV file with an event-based time series as a `pandas.DataFrame` and store it in a 
variable called `df`,
2. print `df`'s first 10 entries/rows, and
3. print `df`'s shape (i.e., a pair containing the number of rows and columns)

In [1]:
import pandas as pd

# 1
df = pd.read_csv('/home/edu/Dropbox/Work/Bolzano/Durst/Data/printer_unordered_565.csv')
# 2
print(df.head(10))
# 3
print(df.shape)

                     time  printer_id  sensor_id  signal_value
0  2020-09-25 17:36:30+02         565         20         33.50
1  2020-09-24 15:46:15+02         565         20         40.80
2  2020-09-24 15:23:23+02         565         20         42.50
3  2020-09-24 15:24:41+02         565         20         42.00
4  2020-09-24 15:30:35+02         565         20         41.50
5  2020-09-24 15:10:07+02         565         15         40.51
6  2020-09-24 15:26:22+02         565         20         42.50
7  2020-09-24 15:26:35+02         565         15         40.09
8  2020-09-25 17:18:51+02         565         20         33.15
9  2020-09-24 15:32:10+02         565         20         42.00
(2152090, 4)


We must **pivot** such representation.

Pivoting means to "open" such row-based representation to a column-wise one 
having each `sensor_id` as a column, where the rows are indexed 
by the pair `printer_id` and `time`, which is a `MultiIndex`. 

The entry $(i,j)$ of such (new) DataFrame is the `signal_value` of `sensor_id` $= j$ 
at `time` $=i$ if it exists; otherwise, is a `NaN` entry. 

Moreover, such entries could contain duplicates and for this reason we need to specify 
an aggregation function `aggfunc`.

In [2]:
dfp = df.pivot_table(index=['printer_id', 'time'], columns='sensor_id', values='signal_value', aggfunc='mean')
print(dfp.head(10))
print(dfp.shape)
print(type(dfp.index))

sensor_id                           9     10    11    12    13    14    15   \
printer_id time                                                               
565        2018-10-02 10:53:51+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 10:53:56+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 11:34:03+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 11:37:21+02   NaN   NaN   NaN   NaN  50.0  50.0  50.0   
           2019-02-27 16:54:45+01  50.0  50.0  50.0  50.0  50.0  50.0  50.0   
           2019-07-17 15:44:43+02  50.0  50.0  50.0  50.0  50.0  50.0  50.0   
           2019-07-17 15:55:39+02   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
           2019-07-17 15:55:40+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
           2019-07-17 16:01:51+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
           2019-07-17 16:02:05+02   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

sensor_id                           16   17   18   

Without further specifications, we convert the DataFrame's second level index type to `DatetimeIndex`. We also use `utc=True` for timezone-related parsing.

In [3]:
dfp.index = dfp.index.set_levels([dfp.index.levels[0], pd.to_datetime(dfp.index.levels[1], utc=True)])
print(dfp)
print(type(dfp.index))


sensor_id                              9     10    11    12    13    14   \
printer_id time                                                            
565        2018-10-02 08:53:51+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 08:53:56+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 09:34:03+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2018-10-02 09:37:21+00:00   NaN   NaN   NaN   NaN  50.0  50.0   
           2019-02-27 15:54:45+00:00  50.0  50.0  50.0  50.0  50.0  50.0   
...                                    ...   ...   ...   ...   ...   ...   
           2021-10-04 11:32:10+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2021-10-04 11:32:11+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2021-10-04 11:35:31+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2021-10-04 11:37:11+00:00   NaN   NaN   NaN   NaN   NaN   NaN   
           2021-10-04 11:38:44+00:00   NaN   NaN   NaN   NaN   NaN   NaN   

sensor_id  

In the next code cell, we `resample` the time-related data points montly (`M`), and we aggregate such values by computing the `mean` value.

Observe that after such an operation, the `printer_id` is lost from the index, and we must restore it to a `MultiIndex`.

In [4]:
dfr = dfp.resample('M', level='time').mean()

print(dfr.index)

dfr.set_index(pd.MultiIndex.from_product([["565"], dfr.index.values], names=["printer_id", "time"]), inplace=True) #.set_index([pd.Index(["564"]), 'time'])

print(dfr.index)

dfr.head()

DatetimeIndex(['2018-10-31 00:00:00+00:00', '2018-11-30 00:00:00+00:00',
               '2018-12-31 00:00:00+00:00', '2019-01-31 00:00:00+00:00',
               '2019-02-28 00:00:00+00:00', '2019-03-31 00:00:00+00:00',
               '2019-04-30 00:00:00+00:00', '2019-05-31 00:00:00+00:00',
               '2019-06-30 00:00:00+00:00', '2019-07-31 00:00:00+00:00',
               '2019-08-31 00:00:00+00:00', '2019-09-30 00:00:00+00:00',
               '2019-10-31 00:00:00+00:00', '2019-11-30 00:00:00+00:00',
               '2019-12-31 00:00:00+00:00', '2020-01-31 00:00:00+00:00',
               '2020-02-29 00:00:00+00:00', '2020-03-31 00:00:00+00:00',
               '2020-04-30 00:00:00+00:00', '2020-05-31 00:00:00+00:00',
               '2020-06-30 00:00:00+00:00', '2020-07-31 00:00:00+00:00',
               '2020-08-31 00:00:00+00:00', '2020-09-30 00:00:00+00:00',
               '2020-10-31 00:00:00+00:00', '2020-11-30 00:00:00+00:00',
               '2020-12-31 00:00:00+00:00', '2021-0

Unnamed: 0_level_0,sensor_id,9,10,11,12,13,14,15,16,17,18,...,663,664,665,666,667,668,669,670,671,672
printer_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
565,2018-10-31,,,,,50.0,50.0,50.0,50.0,,,...,,,,,,,,,,
565,2018-11-30,,,,,,,,,,,...,,,,,,,,,,
565,2018-12-31,,,,,,,,,,,...,,,,,,,,,,
565,2019-01-31,,,,,,,,,,,...,,,,,,,,,,
565,2019-02-28,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,,,...,,,,,0.0,32.0,,1.0,,


---

## Automating the workflow

Putting all together, we can define a function that automatically reads many CSV files and builds a dataset, also known as panel data.

In [1]:
# import numpy as np
import pandas as pd

printers = [565, 574, 628, 679, 686]


def build_panel(printers, freq='M', pivot_aggfn='mean'): # , resample_aggfn='mean'):
    panel = pd.DataFrame()
    for printer_id in printers:
        # 1. read csv
        df = pd.read_csv(f'/home/edu/Dropbox/Work/Bolzano/Durst/Data/printer_unordered_{ printer_id }.csv')
        # 2. pivot table
        dfp = df.pivot_table(index=['printer_id', 'time'], columns='sensor_id', values='signal_value', aggfunc=pivot_aggfn)
        # 3. convert time index to datetime
        dfp.index = dfp.index.set_levels([dfp.index.levels[0], pd.to_datetime(dfp.index.levels[1], utc=True)])
        # 4. resample
        dfr = dfp.resample(freq, level='time').mean()
        # 5. set multiindex
        dfr.set_index(pd.MultiIndex.from_product([[f'{ printer_id }'], dfr.index.values], names=["printer_id", "time"]), inplace=True)
        # 6. concatenate the result
        panel = pd.concat([panel, dfr])

    return panel

panel = build_panel(printers)

We can inspect a particular instance/multivariate time series.

In [9]:
# https://stackoverflow.com/questions/53927460/select-rows-in-pandas-multiindex-dataframe
# panel.loc[['565']]

print([idx for idx in panel.index.levels[0]])

panel.xs('565', level=0, axis=0, drop_level=False).shape

['565', '574', '628', '679', '686']


(37, 118)

---

## Working with `sktime`

The package has many, what they call, **estimators**. 

In [6]:
from sktime.registry import all_estimators

all_estimators(as_dataframe=True)

Unnamed: 0,name,object
0,ARDL,<class 'sktime.forecasting.ardl.ARDL'>
1,ARIMA,<class 'sktime.forecasting.arima.ARIMA'>
2,AggrDist,<class 'sktime.dists_kernels.compose_tab_to_pa...
3,Aggregator,<class 'sktime.transformations.hierarchical.ag...
4,AlignerDTW,<class 'sktime.alignment.dtw_python.AlignerDTW'>
...,...,...
324,WeightedEnsembleClassifier,<class 'sktime.classification.ensemble._weight...
325,WhiteNoiseAugmenter,<class 'sktime.transformations.series.augmente...
326,WindowSummarizer,<class 'sktime.transformations.series.summariz...
327,YfromX,<class 'sktime.forecasting.compose._reduce.Yfr...


Observe that the list is quite long, and we need a better way of viewing it. 

We can filter, for example, only the **transformer**s.

In [17]:
import json

dict([ (tag,value) for tag, value in all_estimators('transformer')[0][1]._tags.items() ]) # , as_dataframe=True)

{'scitype:transform-input': 'Series',
 'scitype:transform-output': 'Series',
 'scitype:transform-labels': 'None',
 'scitype:instancewise': True,
 'X_inner_mtype': ['pd.Series',
  'pd.DataFrame',
  'pd-multiindex',
  'pd_multiindex_hier'],
 'y_inner_mtype': 'None',
 'capability:inverse_transform': False,
 'skip-inverse-transform': True,
 'univariate-only': False,
 'handles-missing-data': False,
 'X-y-must-have-same-index': False,
 'fit_is_empty': True,
 'transform-returns-same-time-index': False}

`sktime` also offers a functionality for inspecting the **tags**.

In [1]:
from sktime.registry import all_tags

all_tags(as_dataframe=True)

Unnamed: 0,name,scitype,type,description
0,X-y-must-have-same-index,"[forecaster, regressor]",bool,do X/y in fit/update and X/fh in predict have ...
1,X_inner_mtype,"[clusterer, forecaster, transformer, transform...","(list, [pd.Series, pd.DataFrame, np.array, nes...",which machine type(s) is the internal _fit/_pr...
2,alignment_type,aligner,"(str, [full, partial])",does aligner produce a full or partial alignment
3,approx_energy_spl,distribution,int,sample size used in approximating generative e...
4,approx_mean_spl,distribution,int,sample size used in approximating generative m...
...,...,...,...,...
57,symmetric,"[transformer-pairwise, transformer-pairwise-pa...",bool,"is the transformer symmetric, i.e., t(x,y)=t(y..."
58,transform-returns-same-time-index,transformer,bool,does transform return same time index as input?
59,univariate-metric,metric,bool,Does the metric only work on univariate y data?
60,univariate-only,transformer,bool,can transformer handle multivariate series? Tr...


We can use such tags to filter the estimators.

In [2]:
all_estimators(
    'transformer',
    filter_tags={'univariate-only': True},
    return_names=False,
)

NameError: name 'all_estimators' is not defined

In [4]:
from sktime.transformations.series.impute import Imputer
from sktime.transformations.panel.dictionary_based import PAA
# from sktime.transformations.series.summarize import SummaryTransformer

import warnings
warnings.filterwarnings('ignore')

imputer = Imputer()
paa = PAA(4)

pipe = imputer * paa

# pipe.fit(panel)

# pipe.transform(panel)

Imputer()._tags

{'scitype:transform-input': 'Series',
 'scitype:transform-output': 'Series',
 'scitype:instancewise': True,
 'X_inner_mtype': ['pd.DataFrame'],
 'y_inner_mtype': 'None',
 'fit_is_empty': False,
 'handles-missing-data': True,
 'skip-inverse-transform': True,
 'capability:inverse_transform': True,
 'univariate-only': False,
 'capability:missing_values:removes': True,
 'remember_data': False}

In [33]:
import inspect
# from sktime.transformations.series.impute import Imputer
import sktime

print(list(inspect.signature(sktime.transformations.series.impute.Imputer.__init__).parameters.keys()))
print(list(inspect.signature(sktime.transformations.series.impute.Imputer.__init__).parameters.values()))

['self', 'method', 'random_state', 'value', 'forecaster', 'missing_values']
[<Parameter "self">, <Parameter "method='drift'">, <Parameter "random_state=None">, <Parameter "value=None">, <Parameter "forecaster=None">, <Parameter "missing_values=None">]
