In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os, sys
import requests

sys.path.insert(0, '..')

from transformers.data_container import *
from transformers.transformers import TimeSeriesTransformer, MeanSeriesTransformer, TimeSeriesWindowTransformer

  from pandas.core import datetools


Usage example will be shown based on open source time series [data set](http://timeseriesclassification.com/Downloads/FordA.zip).

First think one need to do is read data. We use `urlopen` function from built in urllib Python package to download data set in memory. We also shring the length of each series.

In [2]:
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505

In [3]:
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

**lines** now is a list of strings with of timeseries in comma separeted format

505 is a offset for the beginning of seriases

In [4]:
lines = [list(map(float, l.split(','))) for l in lines]

In [5]:
lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Now **lines** is a list of list of floats. Let's convert each embedded list into more convince pandas.Series object.

In [6]:
lines = [pd.Series(l) for l in lines]

In [7]:
lines[0][:10]

0    1.18710
1    0.40960
2   -0.43154
3   -1.23100
4   -1.90550
5   -2.38240
6   -2.58800
7   -2.50180
8   -2.13530
9   -1.57400
dtype: float64

Now we have a list of pandas.Series objects. Next thing we do is encapsulate list into another series called MultiSeries. Thus list of lists of float became into MultiSeries of pandas.Series objects.

We have a global indes for MuiltiSeries and each pandas.Series has it's own index.

In [8]:
X = MultiSeries(lines)

In [9]:
X.head()

0    0      1.187100
1      0.409600
2     -0.43154...
1    0      0.094261
1      0.310310
2      0.53060...
2    0     -1.157000
1     -1.592600
2     -1.50960...
3    0      0.356960
1      0.300850
2      0.24314...
4    0      0.307980
1      0.370350
2      0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Output might seems a bit messy. It prints MultiSeries of pandas.Series and **data_type** property. This property shown the type of underlying data into MultiSeries.

**X** now is a MultiSeries of pd.Serieses. It means, that every element of this MultiSeries is pd.Series.

**X** supports all methods as general pd.Series does.

## Transformers

One of the common task in data science that was a motivation for this project is to extract features from some complex objects (for example series) and than do a fancy machine learning.

Having a MultiSeries of pandas.Series one would like to extract features from each Series. That's where Transformers take place. Let's try on example.

The first simple example of transformer is *TimeSeriesWindowTransformer*. This transformer calculates moving average with given window size.

In [10]:
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

In [11]:
transformed_series.head()

0    0           NaN
1           NaN
2           Na...
1    0           NaN
1           NaN
2           Na...
2    0           NaN
1           NaN
2           Na...
3    0           NaN
1           NaN
2           Na...
4    0           NaN
1           NaN
2           Na...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

In [12]:
transformed_series[0].head(10)

0         NaN
1         NaN
2         NaN
3         NaN
4   -0.394268
5   -1.108168
6   -1.707688
7   -2.121740
8   -2.302600
9   -2.236300
dtype: float64

Let's try another transformer, probably the most common. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can pass you own list of features. As a result we have an object **MultiDataFrame**.

In [13]:
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

In [14]:
type(transformed_series)

transformers.data_container.data_container.MultiDataFrame

In [15]:
transformed_series.head()

Unnamed: 0,max,mean,median,min,quantile_25,quantile_75,quantile_90,quantile_95,std
0,2.5263,0.001995,0.011186,-2.7875,-0.73635,0.74192,1.2534,1.5463,0.999998
1,2.6291,0.001997,-0.024726,-2.4357,-0.67411,0.65808,1.3478,1.6595,0.999997
2,2.6072,-0.001996,0.060685,-3.0132,-0.67588,0.70123,1.2591,1.5184,1.0
3,2.6431,-0.001997,-0.022668,-2.7275,-0.66265,0.56858,1.4102,1.8094,0.999997
4,3.2398,-0.001995,-0.048518,-3.0085,-0.70775,0.64898,1.254,1.6699,1.000001


**MultiDataFrame** is an abstract container based on pandas.DataFrame that can store **MultiSeries** objects.

The main feature of **MultiDataFrame** is that you have columns of **MultiSeries** of any **data_type** with ANY objects. For example, one may has a data set Serieses, Images, Texts, plain Numbers, any custom objects. Imagine a 2d data container that stores it. That one would like to write a chain of transformers to create a simple 2d matrix with number ready-to-sklearn predictor.

Let's see an example with that. First, let's create a **MultiDataFrame**.

Let's **Y** will be a labels for each row.

In [16]:
Y = np.random.binomial(1, 0.5, X.shape[0])
Y = MultiSeries(Y)

In [17]:
df = MultiDataFrame({
    'series': X,
    'labels': Y
})

In [18]:
df.head()

Unnamed: 0,labels,series
0,0,0 1.187100 1 0.409600 2 -0.43154...
1,0,0 0.094261 1 0.310310 2 0.53060...
2,1,0 -1.157000 1 -1.592600 2 -1.50960...
3,1,0 0.356960 1 0.300850 2 0.24314...
4,1,0 0.307980 1 0.370350 2 0.26015...


In [19]:
from transformers.transformers import PipeLineChain
from sklearn.decomposition import PCA

In [20]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(window=3)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

<transformers.transformers.pipeline_transformer.pipeline_transformer.PipeLineChain at 0x117659978>

In [21]:
transformed_X = chain.transform(X)

In [23]:
transformed_X.head()

Unnamed: 0,0,1,2,3,4
0,-0.075679,-0.211572,-0.100136,0.001429,-0.002408
1,-0.186301,0.102065,0.086118,-0.022691,0.016804
2,0.107721,-0.218586,-0.138592,0.09428,0.018337
3,0.017658,-0.060773,0.220791,-0.101369,0.033387
4,0.454651,0.149669,-0.046983,-0.061386,-0.009684


Now this data set is ready for a plain sklearn estimators!