# Description

This example shows basic usage of **transformer** project.

## Read data

In [5]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os, sys
import requests

sys.path.insert(0, '..')

from XPandas.data_container import *
from XPandas.transformers import TimeSeriesTransformer, TimeSeriesWindowTransformer

  from pandas.core import datetools


Usage example will be shown based on open source time series [data set](http://timeseriesclassification.com/Downloads/FordA.zip).

First think one need to do is read data. We use `urlopen` function from built in urllib Python package to download data set in memory. We also shrink the length of each series.

In [6]:
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505

In [7]:
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

**lines** now is a list of strings with of timeseries in comma separeted format

505 is a offset for the beginning of seriases

In [8]:
lines = [list(map(float, l.split(','))) for l in lines]

In [9]:
lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Now **lines** is a list of list of floats. Let's convert each embedded list into more convince pandas.Series object.

In [10]:
lines = [pd.Series(l) for l in lines]

In [11]:
lines[0][:10]

0    1.18710
1    0.40960
2   -0.43154
3   -1.23100
4   -1.90550
5   -2.38240
6   -2.58800
7   -2.50180
8   -2.13530
9   -1.57400
dtype: float64

## Main usage

Now we have a list of pandas.Series objects. Next thing we do is encapsulate list into another series called XSeries. Thus list of lists of float became into XSeries of pandas.Series objects.

We have a global indes for MuiltiSeries and each pandas.Series has it's own index.

In [12]:
X = XSeries(lines)

In [13]:
X.head()

0    0      1.187100
1      0.409600
2     -0.43154...
1    0      0.094261
1      0.310310
2      0.53060...
2    0     -1.157000
1     -1.592600
2     -1.50960...
3    0      0.356960
1      0.300850
2      0.24314...
4    0      0.307980
1      0.370350
2      0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Output might seems a bit messy. It prints XSeries of pandas.Series and **data_type** property. This property shown the type of underlying data into XSeries.

**X** now is a XSeries of pd.Serieses. It means, that every element of this XSeries is pd.Series.

**X** supports all methods as basic pd.Series does.

## Transformers

One of the common task in data science that was a motivation for this project is to extract features from some complex objects (for example series) and than do a fancy machine learning.

Having a XSeries of pandas.Series one would like to extract features from each Series or do kind of feature extraction. That's where **Transformers** take place. Let's try on example.

The first simple example of transformer is *TimeSeriesWindowTransformer*. This transformer calculates moving average with given window size.

In [14]:
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

In [15]:
transformed_series.head()

0    4     -0.394268
5     -1.108168
6     -1.70768...
1    4      0.509686
5      0.680500
6      0.80574...
2    4     -1.098344
5     -0.755320
6     -0.21608...
3    4      0.234223
5      0.165730
6      0.09269...
4    4      0.202701
5      0.154336
6      0.14082...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

In [16]:
transformed_series[0].head(10)

4    -0.394268
5    -1.108168
6    -1.707688
7    -2.121740
8    -2.302600
9    -2.236300
10   -1.942152
11   -1.469980
12   -0.891442
13   -0.287676
dtype: float64

Let's try another transformer, probably the most common. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can pass you own list of features. As a result we have an object **XDataFrame**.

In [17]:
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

In [18]:
type(transformed_series)

XPandas.data_container.data_container.XDataFrame

In [19]:
transformed_series.head()

Unnamed: 0,TimeSeriesTransformer_max,TimeSeriesTransformer_mean,TimeSeriesTransformer_median,TimeSeriesTransformer_min,TimeSeriesTransformer_quantile_25,TimeSeriesTransformer_quantile_75,TimeSeriesTransformer_quantile_90,TimeSeriesTransformer_quantile_95,TimeSeriesTransformer_std
0,2.5263,0.001995,0.011186,-2.7875,-0.73635,0.74192,1.2534,1.5463,0.999998
1,2.6291,0.001997,-0.024726,-2.4357,-0.67411,0.65808,1.3478,1.6595,0.999997
2,2.6072,-0.001996,0.060685,-3.0132,-0.67588,0.70123,1.2591,1.5184,1.0
3,2.6431,-0.001997,-0.022668,-2.7275,-0.66265,0.56858,1.4102,1.8094,0.999997
4,3.2398,-0.001995,-0.048518,-3.0085,-0.70775,0.64898,1.254,1.6699,1.000001


Also one can try TSFresh transformer

In [21]:
from XPandas.transformers import TsFreshSeriesTransformer

In [22]:
tr = TsFreshSeriesTransformer()
tr.fit(X.head())
transformed_series = tr.transform(X.head())

In [23]:
transformed_series

Unnamed: 0,None__abs_energy,None__absolute_sum_of_changes,"None__agg_autocorrelation__f_agg_""mean""","None__agg_autocorrelation__f_agg_""median""","None__agg_autocorrelation__f_agg_""var""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","None__agg_linear_trend__f_agg_""max""__chunk_len_50__attr_""intercept""",...,None__time_reversal_asymmetry_statistic__lag_1,None__time_reversal_asymmetry_statistic__lag_2,None__time_reversal_asymmetry_statistic__lag_3,None__value_count__value_-inf,None__value_count__value_0,None__value_count__value_1,None__value_count__value_inf,None__value_count__value_nan,None__variance,None__variance_larger_than_standard_deviation
0,500.000126,134.51328,-0.012049,0.020975,0.129473,0.949115,-0.027843,-0.001282,0.006577,1.719095,...,-0.007374,-0.012282,-0.015967,0,0,1,0,0,0.998,False
1,499.99929,114.289925,0.003075,-0.001312,0.227381,0.810495,0.01456,0.000742,0.007283,1.736565,...,0.0004,-0.00228,-0.011068,0,0,1,0,0,0.997999,False
2,500.001514,164.089622,-0.013172,-0.005309,0.066725,1.065365,-0.088829,-0.004117,0.006595,2.183568,...,0.004354,0.003302,-0.017264,0,0,0,0,0,0.998003,False
3,499.999445,103.51004,-0.005639,-0.0226,0.28182,1.141145,-0.214194,-0.012728,0.008292,2.641368,...,-0.00246,0.000917,0.012002,0,0,0,0,0,0.997999,False
4,500.003011,154.299542,0.001552,0.006698,0.060942,0.842727,0.119359,0.007096,0.008432,2.007018,...,0.000949,0.005811,0.019023,0,0,0,0,0,0.998006,False


One may ask a question "Well, it's a nice transformer, but can I do a pipeline like useing scikit-learn [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)".

Sure! Let's see on example. We try to combine TimeSeriesTransformer and TimeSeriesWindowTransformer into one pipeline. That's where **PipeLineChain** comes for help.

In [25]:
from XPandas.transformers import PipeLineChain

In [26]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer())
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])

In [27]:
chain.get_params

<bound method Pipeline.get_params of PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])>

In [28]:
transformed_X = chain.transform(X)

In [29]:
transformed_X.head()

Unnamed: 0,TimeSeriesTransformer_max,TimeSeriesTransformer_mean,TimeSeriesTransformer_median,TimeSeriesTransformer_min,TimeSeriesTransformer_quantile_25,TimeSeriesTransformer_quantile_75,TimeSeriesTransformer_quantile_90,TimeSeriesTransformer_quantile_95,TimeSeriesTransformer_std
0,2.16144,0.002078,0.017664,-2.48536,-0.633164,0.65868,1.156081,1.403896,0.896272
1,2.39636,-0.002229,-0.030936,-2.27418,-0.649324,0.596704,1.193813,1.580243,0.924516
2,2.32512,0.005656,0.057383,-2.43876,-0.567,0.615622,1.025741,1.343563,0.855456
3,2.4443,0.000632,-0.006893,-2.51592,-0.628024,0.544161,1.332912,1.674656,0.938187
4,2.64094,-0.001295,-0.036442,-2.47826,-0.58778,0.594925,1.130192,1.429327,0.860424


All right! Let's try to add scikit-learn transformer to the PipeLineChain. For example, let's do PCA on transformed_X.

In [30]:
from sklearn.decomposition import PCA

In [31]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))])

In [32]:
transformed_X = chain.transform(X)

In [33]:
transformed_X.head()

Unnamed: 0,0,1,2,3,4
0,-0.133152,-0.242552,0.097523,-0.004435,-0.009747
1,-0.125413,0.076021,-0.089267,0.010531,0.017437
2,-0.028607,-0.088828,0.205043,0.098009,0.032338
3,0.071478,-0.058813,-0.247669,-0.02355,-0.052968
4,0.200611,0.110884,0.0642,0.012187,-0.038497


Let's do even more interesting things! Let's try to add a scikit-learn estimator at the end of PipeLineChain!

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [36]:
Y = np.random.binomial(1, 0.5, X.shape[0])

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

Be sure that types of X_train and X_test are XSeries.

In [38]:
print(type(X_train))
print(type(X_test))

<class 'XPandas.data_container.data_container.XSeries'>
<class 'XPandas.data_container.data_container.XSeries'>


In [39]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())
    
])
chain = chain.fit(X_train, y_train)

In [40]:
prediction = chain.predict(X_test)

In [41]:
accuracy_score(y_test, prediction)

0.48822095857026809

It works! 

One can also create inline CustomTransfomer like this

In [42]:
from XPandas.transformers import CustomTransformer

In [43]:
my_awesome_transfomer = CustomTransformer(transform_function=lambda x: x.std())

In [44]:
my_awesome_transfomer.fit(X)

CustomTransformer(data_types=None, name='CustomTransformer',
         transform_function=<function <lambda> at 0x115e8c950>)

In [45]:
my_awesome_transfomer.transform(X).head()

0    0.999998
1    0.999997
2    1.000000
3    0.999997
4    1.000001
dtype: float64
data_type: <class 'numpy.float64'>

If you want to create your custom transformer with any complex logic, please take a look at internal implementation of transformers.

Now let's take a look at **XDataFrame** class.

## XDataFrame class

**XDataFrame** is an abstract container based on pandas.DataFrame that can store **XSeries** objects.

The main feature of **XDataFrame** is that you have columns of **XSeries** of any **data_type** with ANY objects. For example, one may has a data set Serieses, Images, Texts, plain Numbers, any custom objects. Imagine a 2d data container that stores it. That one would like to write a chain of transformers to create a simple 2d matrix with number ready-to-sklearn predictor.

Let's see an example with that. First, let's create a **XDataFrame**.

Let's **Y** will be a labels for each row.

In [46]:
Y = XSeries(Y)

In [47]:
df = XDataFrame({
    'X': X,
    'Y': Y
})

In [48]:
df.head()

Unnamed: 0,X,Y
0,0 1.187100 1 0.409600 2 -0.43154...,1
1,0 0.094261 1 0.310310 2 0.53060...,0
2,0 -1.157000 1 -1.592600 2 -1.50960...,1
3,0 0.356960 1 0.300850 2 0.24314...,0
4,0 0.307980 1 0.370350 2 0.26015...,1


Add new column to XDataFrame.

In [49]:
df['X_1'] = XSeries([
    pd.Series(np.random.normal(size=100))
    for _ in range(X.shape[0])
])

If one wants to transform **XDataFrame** one has to specify transformation logic for a columns needed to be transformed. One can do this using **DataFrameTransformer**.

For examplem lets apply **TimeSeriesWindowTransformer** to X column and **TimeSeriesTransformer** to $X_1$ column.

In [50]:
from XPandas.transformers import DataFrameTransformer

In [51]:
df_transformer = DataFrameTransformer({
    'X': TimeSeriesWindowTransformer(windows_size=4),
    'X_1': TimeSeriesTransformer()
})

In [52]:
df_transformer.fit(df)

DataFrameTransformer(transformations={'X': TimeSeriesWindowTransformer(windows_size=4), 'X_1': TimeSeriesTransformer(features=None)})

In [53]:
transformed_df = df_transformer.transform(df)

In [54]:
transformed_df.head()

Unnamed: 0,X,Y,TimeSeriesTransformer_max,TimeSeriesTransformer_mean,TimeSeriesTransformer_median,TimeSeriesTransformer_min,TimeSeriesTransformer_quantile_25,TimeSeriesTransformer_quantile_75,TimeSeriesTransformer_quantile_90,TimeSeriesTransformer_quantile_95,TimeSeriesTransformer_std
0,3 -0.016460 4 -0.789610 5 -1.48761...,1,1.947551,-0.01584,0.071164,-2.564157,-0.730519,0.689059,1.144133,1.407224,0.961483
1,3 0.416408 4 0.613542 5 0.77304...,0,2.327146,0.098496,0.165579,-2.499361,-0.507944,0.770032,1.166795,1.797573,0.988724
2,3 -1.315175 4 -1.083680 5 -0.54600...,1,3.257576,0.137199,0.090058,-2.053543,-0.560547,0.785426,1.292849,1.550931,0.954297
3,3 0.268788 4 0.203539 5 0.13194...,0,2.850513,-0.048726,-0.018989,-2.990853,-0.718245,0.692489,1.368423,1.762179,1.090315
4,3 0.255629 4 0.176381 5 0.10033...,1,2.648693,-0.076139,-0.152326,-2.695419,-0.750743,0.544101,1.405485,1.706039,1.076105
