In [None]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os
import sys
import requests

sys.path.insert(0, '..')

from transformers.data_container import *
from transformers.transformers import TimeSeriesTransformer, MeanSeriesTransformer, TimeSeriesWindowTransformer, Ts

## Multiseries

The usage example shown is based on open source time series [data set](http://timeseriesclassification.com/Downloads/FordA.zip).

The first thing we need to do is to read data. Here, we use the `urlopen` function from Python's built-in urllib to download data set and limit the length of each data series.

In [1]:
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505 # offset for the beginning of the series

In [3]:
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

``lines`` is now a list of strings representing timeseries in a comma separated format that we can convert into floats

In [4]:
lines = [list(map(float, l.split(','))) for l in lines]

In [5]:
lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Let's convert each embedded list into more convenient ``pandas.Series`` object.

In [13]:
lines = [pd.Series(l) for l in lines]

In [14]:
lines[0][:10]

0    1.18710
1    0.40960
2   -0.43154
3   -1.23100
4   -1.90550
5   -2.38240
6   -2.58800
7   -2.50180
8   -2.13530
9   -1.57400
dtype: float64

Using the ``pandas.Series`` objects we can encapsulate the list into another ``MultiSeries`` object.

In [15]:
X = MultiSeries(lines)

The object has a global index of series and an sub-index for each ``pandas.Series``.

In [16]:
X.head()

0    0      1.187100
1      0.409600
2     -0.43154...
1    0      0.094261
1      0.310310
2      0.53060...
2    0     -1.157000
1     -1.592600
2     -1.50960...
3    0      0.356960
1      0.300850
2      0.24314...
4    0      0.307980
1      0.370350
2      0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

The output reveals the ``data_type`` property of the MultiSeries object which contains the type of the contained objects, in this case, pandas series. The multiseries is thus build up of panda series. Specifically, ``X`` supports all methods of its containing object pd.Series.

## Transformers

A major motivation for this project is the common data science task of extracting features from some complex objects (for example series) before proceeding with the machine learning.

Given a MultiSeries of pandas.Series one would, for instance, like to extract features from each series. That's where *Transformers* play a vital role. Let's explore some examples.

### TimeSeriesWindowTransformer

This transformer calculates moving average with given window size.

In [17]:
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

In [11]:
transformed_series.head()

0    0           NaN
1           NaN
2           Na...
1    0           NaN
1           NaN
2           Na...
2    0           NaN
1           NaN
2           Na...
3    0           NaN
1           NaN
2           Na...
4    0           NaN
1           NaN
2           Na...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

In [18]:
transformed_series[0].head(10)

0         NaN
1         NaN
2         NaN
3         NaN
4   -0.394268
5   -1.108168
6   -1.707688
7   -2.121740
8   -2.302600
9   -2.236300
dtype: float64

### TimeSeriesTransformer

Let's try another transformer, probably the most common one. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can also pass you own list of features. As a result we retrieve a ``MultiDataFrame`` object.

In [19]:
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

In [20]:
type(transformed_series)

transformers.data_container.data_container.MultiDataFrame

In [21]:
transformed_series.head()

Unnamed: 0,max,mean,median,min,quantile_25,quantile_75,quantile_90,quantile_95,std
0,2.5263,0.001995,0.011186,-2.7875,-0.73635,0.74192,1.2534,1.5463,0.999998
1,2.6291,0.001997,-0.024726,-2.4357,-0.67411,0.65808,1.3478,1.6595,0.999997
2,2.6072,-0.001996,0.060685,-3.0132,-0.67588,0.70123,1.2591,1.5184,1.0
3,2.6431,-0.001997,-0.022668,-2.7275,-0.66265,0.56858,1.4102,1.8094,0.999997
4,3.2398,-0.001995,-0.048518,-3.0085,-0.70775,0.64898,1.254,1.6699,1.000001


``MultiDataFrame`` is an abstract container that is based on ``pandas.DataFrame`` and stores ``MultiSeries`` objects.

The main feature of the ``MultiDataFrame`` are columns of ``MultiSeries`` that can contain and manage any **data_type**. For example, one may have a data set consisting of series, images, texts, plain numbers, or even custom objects. Ideally, we would want to handle such different data types in a unified 2D data container, e.g. a chain of transformers to create a simple 2D matrix of training data.

The following examples illustrates such a ``MultiDataFrame`` workflow.

Let ``Y`` be a vector of labels for each row.

In [16]:
Y = np.random.binomial(1, 0.5, X.shape[0])
Y = MultiSeries(Y)

In [17]:
df = MultiDataFrame({
    'series': X,
    'labels': Y
})

In [18]:
df.head()

Unnamed: 0,labels,series
0,0,0 1.187100 1 0.409600 2 -0.43154...
1,0,0 0.094261 1 0.310310 2 0.53060...
2,1,0 -1.157000 1 -1.592600 2 -1.50960...
3,1,0 0.356960 1 0.300850 2 0.24314...
4,1,0 0.307980 1 0.370350 2 0.26015...


In [22]:
from transformers.transformers import PipeLineChain
from sklearn.decomposition import PCA

In [23]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(window=3)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

<transformers.transformers.pipeline_transformer.pipeline_transformer.PipeLineChain at 0x11caa9400>

In [24]:
transformed_X = chain.transform(X)

In [28]:
transformed_X.head()

Unnamed: 0,0,1,2,3,4
0,-0.075679,-0.211572,-0.100136,0.001429,-0.002408
1,-0.186301,0.102065,0.086118,-0.022691,0.016804
2,0.107721,-0.218586,-0.138592,0.09428,0.018337
3,0.017658,-0.060773,0.220791,-0.101369,0.033387
4,0.454651,0.149669,-0.046983,-0.061386,-0.009684


The data retrieved can be readibly used in respective models, e.g. a scikit-learn estimator.