# Description

This example shows basic usage of **transformer** project.

## Read data

In [5]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os, sys
import requests

sys.path.insert(0, '..')

from XPandas.data_container import *
from XPandas.transformers import TimeSeriesTransformer, TimeSeriesWindowTransformer

  from pandas.core import datetools


Usage example will be shown based on open source time series [data set](http://timeseriesclassification.com/Downloads/FordA.zip).

First think one need to do is read data. We use `urlopen` function from built in urllib Python package to download data set in memory. We also shrink the length of each series.

In [6]:
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505

In [7]:
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

**lines** now is a list of strings with of timeseries in comma separeted format

505 is a offset for the beginning of seriases

In [8]:
lines = [list(map(float, l.split(','))) for l in lines]

In [9]:
lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Now **lines** is a list of list of floats. Let's convert each embedded list into more convince pandas.Series object.

In [10]:
lines = [pd.Series(l) for l in lines]

In [11]:
lines[0][:10]

0    1.18710
1    0.40960
2   -0.43154
3   -1.23100
4   -1.90550
5   -2.38240
6   -2.58800
7   -2.50180
8   -2.13530
9   -1.57400
dtype: float64

## Main usage

Now we have a list of pandas.Series objects. Next thing we do is encapsulate list into another series called XSeries. Thus list of lists of float became into XSeries of pandas.Series objects.

We have a global indes for MuiltiSeries and each pandas.Series has it's own index.

In [12]:
X = XSeries(lines)

In [13]:
X.head()

0    0      1.187100
1      0.409600
2     -0.43154...
1    0      0.094261
1      0.310310
2      0.53060...
2    0     -1.157000
1     -1.592600
2     -1.50960...
3    0      0.356960
1      0.300850
2      0.24314...
4    0      0.307980
1      0.370350
2      0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Output might seems a bit messy. It prints XSeries of pandas.Series and **data_type** property. This property shown the type of underlying data into XSeries.

**X** now is a XSeries of pd.Serieses. It means, that every element of this XSeries is pd.Series.

**X** supports all methods as basic pd.Series does.

## Transformers

One of the common task in data science that was a motivation for this project is to extract features from some complex objects (for example series) and than do a fancy machine learning.

Having a XSeries of pandas.Series one would like to extract features from each Series or do kind of feature extraction. That's where **Transformers** take place. Let's try on example.

The first simple example of transformer is *TimeSeriesWindowTransformer*. This transformer calculates moving average with given window size.

In [14]:
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

In [15]:
transformed_series.head()

0    4     -0.394268
5     -1.108168
6     -1.70768...
1    4      0.509686
5      0.680500
6      0.80574...
2    4     -1.098344
5     -0.755320
6     -0.21608...
3    4      0.234223
5      0.165730
6      0.09269...
4    4      0.202701
5      0.154336
6      0.14082...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

In [16]:
transformed_series[0].head(10)

4    -0.394268
5    -1.108168
6    -1.707688
7    -2.121740
8    -2.302600
9    -2.236300
10   -1.942152
11   -1.469980
12   -0.891442
13   -0.287676
dtype: float64

Let's try another transformer, probably the most common. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can pass you own list of features. As a result we have an object **XDataFrame**.

In [17]:
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

In [18]:
type(transformed_series)

XPandas.data_container.data_container.XDataFrame

In [19]:
transformed_series.head()

Unnamed: 0,TimeSeriesTransformer_max,TimeSeriesTransformer_mean,TimeSeriesTransformer_median,TimeSeriesTransformer_min,TimeSeriesTransformer_quantile_25,TimeSeriesTransformer_quantile_75,TimeSeriesTransformer_quantile_90,TimeSeriesTransformer_quantile_95,TimeSeriesTransformer_std
0,2.5263,0.001995,0.011186,-2.7875,-0.73635,0.74192,1.2534,1.5463,0.999998
1,2.6291,0.001997,-0.024726,-2.4357,-0.67411,0.65808,1.3478,1.6595,0.999997
2,2.6072,-0.001996,0.060685,-3.0132,-0.67588,0.70123,1.2591,1.5184,1.0
3,2.6431,-0.001997,-0.022668,-2.7275,-0.66265,0.56858,1.4102,1.8094,0.999997
4,3.2398,-0.001995,-0.048518,-3.0085,-0.70775,0.64898,1.254,1.6699,1.000001


Also one can try TSFresh transformer

In [20]:
from transformers.transformers import TsFreshSeriesTransformer

ModuleNotFoundError: No module named 'transformers'

In [None]:
tr = TsFreshSeriesTransformer()
tr.fit(X.head())
transformed_series = tr.transform(X.head())

In [None]:
transformed_series

One may ask a question "Well, it's a nice transformer, but can I do a pipeline like useing scikit-learn [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)".

Sure! Let's see on example. We try to combine TimeSeriesTransformer and TimeSeriesWindowTransformer into one pipeline. That's where **PipeLineChain** comes for help.

In [None]:
from transformers.transformers import PipeLineChain

In [None]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer())
])
chain.fit(X)

In [None]:
chain.get_params

In [None]:
transformed_X = chain.transform(X)

In [None]:
transformed_X.head()

All right! Let's try to add scikit-learn transformer to the PipeLineChain. For example, let's do PCA on transformed_X.

In [None]:
from sklearn.decomposition import PCA

In [None]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

In [None]:
transformed_X = chain.transform(X)

In [None]:
transformed_X.head()

Let's do even more interesting things! Let's try to add a scikit-learn estimator at the end of PipeLineChain!

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

Be sure that types of X_train and X_test are XSeries.

In [None]:
print(type(X_train))
print(type(X_test))

In [None]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())
    
])
chain = chain.fit(X_train, y_train)

In [None]:
prediction = chain.predict(X_test)

In [None]:
accuracy_score(y_test, prediction)

It works! 

One can also create inline CustomTransfomer like this

In [None]:
from transformers.transformers import CustomTransformer

In [None]:
my_awesome_transfomer = CustomTransformer(transform_function=lambda x: x.std())

In [None]:
my_awesome_transfomer.fit(X)

In [None]:
my_awesome_transfomer.transform(X).head()

If you want to create your custom transformer with any complex logic, please take a look at internal implementation of transformers.

Now let's take a look at **XDataFrame** class.

## XDataFrame class

**XDataFrame** is an abstract container based on pandas.DataFrame that can store **XSeries** objects.

The main feature of **XDataFrame** is that you have columns of **XSeries** of any **data_type** with ANY objects. For example, one may has a data set Serieses, Images, Texts, plain Numbers, any custom objects. Imagine a 2d data container that stores it. That one would like to write a chain of transformers to create a simple 2d matrix with number ready-to-sklearn predictor.

Let's see an example with that. First, let's create a **XDataFrame**.

Let's **Y** will be a labels for each row.

In [None]:
Y = np.random.binomial(1, 0.5, X.shape[0])
Y = XSeries(Y)

In [None]:
df = XDataFrame({
    'X': X,
    'Y': Y
})

In [None]:
df.head()

Add new column to XDataFrame.

In [None]:
df['X_1'] = XSeries([
    pd.Series(np.random.normal(size=100))
    for _ in range(X.shape[0])
])

If one wants to transform **XDataFrame** one has to specify transformation logic for a columns needed to be transformed. One can do this using **DataFrameTransformer**.

For examplem lets apply **TimeSeriesWindowTransformer** to X column and **TimeSeriesTransformer** to $X_1$ column.

In [None]:
from XPandas.transformers import DataFrameTransformer

In [None]:
df_transformer = DataFrameTransformer({
    'X': TimeSeriesWindowTransformer(windows_size=4),
    'X_1': TimeSeriesTransformer()
})

In [None]:
df_transformer.fit(df)

In [None]:
transformed_df = df_transformer.transform(df)

In [None]:
transformed_df.head()