# User Guide

This notebook presents basic usage examples of the XPandas package.

### Example dataset

In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

import numpy as np
import pandas as pd
import os, sys
import requests

sys.path.insert(0, '..')

from XPandas.data_container import *
from XPandas.transformers import TimeSeriesTransformer, TimeSeriesWindowTransformer

  from pandas.core import datetools


The usage example shown is based on open source time series [data set](http://timeseriesclassification.com/Downloads/FordA.zip).

The first thing we need to do is to read data. Here, we use the `urlopen` function from Python's built-in urllib to download data set and limit the length of each data series.

In [2]:
url = "http://timeseriesclassification.com/Downloads/FordA.zip"
series_offset = 505

In [3]:
url = urlopen(url)
zipfile = ZipFile(BytesIO(url.read()))
lines = zipfile.open('FordA/FordA.csv').readlines()
lines = [l.decode('utf-8') for l in lines]
lines = lines[series_offset:]

``lines`` is now a list of strings representing timeseries in a comma separated format that we can convert into floats

In [4]:
lines = [list(map(float, l.split(','))) for l in lines]

In [5]:
lines[0][:10]

[1.1871,
 0.4096,
 -0.43154,
 -1.231,
 -1.9055,
 -2.3824,
 -2.588,
 -2.5018,
 -2.1353,
 -1.574]

Let's convert each embedded list into more convenient ``pandas.Series`` object.

In [6]:
lines = [pd.Series(l) for l in lines]

In [7]:
lines[0][:10]

0    1.18710
1    0.40960
2   -0.43154
3   -1.23100
4   -1.90550
5   -2.38240
6   -2.58800
7   -2.50180
8   -2.13530
9   -1.57400
dtype: float64

## XPandas: Data structures

### XSeries

``XSeries`` is a 1d data container that can store any objects inside.

Using the ``pandas.Series`` objects we can encapsulate the list ``lines`` into ``XSeries`` object. The object has a global index of series and an sub-index for each ``pandas.Series``.

In [8]:
X = XSeries(lines)

In [9]:
X.head()

0    0      1.187100
1      0.409600
2     -0.43154...
1    0      0.094261
1      0.310310
2      0.53060...
2    0     -1.157000
1     -1.592600
2     -1.50960...
3    0      0.356960
1      0.300850
2      0.24314...
4    0      0.307980
1      0.370350
2      0.26015...
dtype: object
data_type: <class 'pandas.core.series.Series'>

The output reveals the ``data_type`` property of the ``XSeries`` object which contains the type of the contained objects, in this case, ``pandas.Series``. The ``XSeries`` is thus build up of ``pandas.Series``. Specifically, ``X`` supports all methods of its containing object ``pandas.Series``.

### XDataFrame

``XDataFrame`` is an abstract 2d container that is based on ``pandas.DataFrame`` and stores ``XSeries`` objects.

The main feature of the ``XDataFrame`` are columns of ``XSeries`` that can contain and manage any **data_type**. For example, one may have a data set consisting of series, images, texts, plain numbers, or even custom objects. Ideally, we would want to handle such different data types in a unified 2d data container, e.g. a chain of transformers to create a simple 2d matrix of training data.

The following examples illustrates such a ``XDataFrame`` workflow.

Let ``Y`` be a vector of labels for each row.

In [10]:
Y = np.random.binomial(1, 0.5, X.shape[0])
Y = XSeries(Y)

In [11]:
df = XDataFrame({
    'X': X,
    'Y': Y
})

In [12]:
df.head()

Unnamed: 0,X,Y
0,0 1.187100 1 0.409600 2 -0.43154...,0
1,0 0.094261 1 0.310310 2 0.53060...,1
2,0 -1.157000 1 -1.592600 2 -1.50960...,0
3,0 0.356960 1 0.300850 2 0.24314...,0
4,0 0.307980 1 0.370350 2 0.26015...,1


Add new column to XDataFrame:

In [13]:
df['X_1'] = XSeries([
    pd.Series(np.random.normal(size=100))
    for _ in range(X.shape[0])
])

## XPandas: Transformers

A major motivation for this project is the common data science task of extracting features from some complex objects (for example series) before proceeding with the machine learning.

Given a ``XSeries`` of ``pandas.Series`` one would, for instance, like to extract features from each series. That's where *Transformers* play a vital role.

Each ``Transformer`` object support ``fit, transform`` methods just like [scikit-learn transformers](http://scikit-learn.org/stable/data_transforms.html).

Let's explore some examples.

### TimeSeriesWindowTransformer

This transformer calculates moving average with given window size.

In [14]:
tr = TimeSeriesWindowTransformer(windows_size=5)
tr.fit(X)
transformed_series = tr.transform(X)

In [15]:
transformed_series.head()

0    4     -0.394268
5     -1.108168
6     -1.70768...
1    4      0.509686
5      0.680500
6      0.80574...
2    4     -1.098344
5     -0.755320
6     -0.21608...
3    4      0.234223
5      0.165730
6      0.09269...
4    4      0.202701
5      0.154336
6      0.14082...
dtype: object
data_type: <class 'pandas.core.series.Series'>

Of course, with a windows_size = 5 first 4 elements are NaN.

In [16]:
transformed_series[0].head(10)

4    -0.394268
5    -1.108168
6    -1.707688
7    -2.121740
8    -2.302600
9    -2.236300
10   -1.942152
11   -1.469980
12   -0.891442
13   -0.287676
dtype: float64

### TimeSeriesTransformer

Let's try another transformer, probably the most common one. It extract several quantitative features from each pandas.Series like mean, std, quantiles. You can also pass you own list of features. As a result we retrieve a ``XDataFrame`` object.

In [17]:
tr = TimeSeriesTransformer()
tr.fit(X)
transformed_series = tr.transform(X)

In [18]:
type(transformed_series)

XPandas.data_container.data_container.XDataFrame

In [19]:
transformed_series.head()

Unnamed: 0,None_TimeSeriesTransformer_max,None_TimeSeriesTransformer_mean,None_TimeSeriesTransformer_median,None_TimeSeriesTransformer_min,None_TimeSeriesTransformer_quantile_25,None_TimeSeriesTransformer_quantile_75,None_TimeSeriesTransformer_quantile_90,None_TimeSeriesTransformer_quantile_95,None_TimeSeriesTransformer_std
0,2.5263,0.001995,0.011186,-2.7875,-0.73635,0.74192,1.2534,1.5463,0.999998
1,2.6291,0.001997,-0.024726,-2.4357,-0.67411,0.65808,1.3478,1.6595,0.999997
2,2.6072,-0.001996,0.060685,-3.0132,-0.67588,0.70123,1.2591,1.5184,1.0
3,2.6431,-0.001997,-0.022668,-2.7275,-0.66265,0.56858,1.4102,1.8094,0.999997
4,3.2398,-0.001995,-0.048518,-3.0085,-0.70775,0.64898,1.254,1.6699,1.000001


We can also make use of the TSFresh transformer

In [20]:
from XPandas.transformers import TsFreshSeriesTransformer

In [21]:
tr = TsFreshSeriesTransformer()
tr.fit(X.head())
transformed_series = tr.transform(X.head())

In [22]:
transformed_series

Unnamed: 0,None__abs_energy,None__absolute_sum_of_changes,"None__agg_autocorrelation__f_agg_""mean""","None__agg_autocorrelation__f_agg_""median""","None__agg_autocorrelation__f_agg_""var""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""intercept""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""rvalue""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""slope""","None__agg_linear_trend__f_agg_""max""__chunk_len_10__attr_""stderr""","None__agg_linear_trend__f_agg_""max""__chunk_len_50__attr_""intercept""",...,None__time_reversal_asymmetry_statistic__lag_1,None__time_reversal_asymmetry_statistic__lag_2,None__time_reversal_asymmetry_statistic__lag_3,None__value_count__value_-inf,None__value_count__value_0,None__value_count__value_1,None__value_count__value_inf,None__value_count__value_nan,None__variance,None__variance_larger_than_standard_deviation
0,500.000126,134.51328,-0.012049,0.020975,0.129473,0.949115,-0.027843,-0.001282,0.006577,1.719095,...,-0.007374,-0.012282,-0.015967,0,0,1,0,0,0.998,False
1,499.99929,114.289925,0.003075,-0.001312,0.227381,0.810495,0.01456,0.000742,0.007283,1.736565,...,0.0004,-0.00228,-0.011068,0,0,1,0,0,0.997999,False
2,500.001514,164.089622,-0.013172,-0.005309,0.066725,1.065365,-0.088829,-0.004117,0.006595,2.183568,...,0.004354,0.003302,-0.017264,0,0,0,0,0,0.998003,False
3,499.999445,103.51004,-0.005639,-0.0226,0.28182,1.141145,-0.214194,-0.012728,0.008292,2.641368,...,-0.00246,0.000917,0.012002,0,0,0,0,0,0.997999,False
4,500.003011,154.299542,0.001552,0.006698,0.060942,0.842727,0.119359,0.007096,0.008432,2.007018,...,0.000949,0.005811,0.019023,0,0,0,0,0,0.998006,False


### Custom inline Transformer

One can also create inline ``CustomTransfomer`` like this

In [23]:
from XPandas.transformers import XSeriesTransformer

In [24]:
my_awesome_transfomer = XSeriesTransformer(transform_function=lambda x: x.std())

In [25]:
my_awesome_transfomer.fit(X)

XSeriesTransformer(data_types=None, name='XSeriesTransformer',
          transform_function=<function <lambda> at 0x111c64d90>)

In [26]:
my_awesome_transfomer.transform(X).head()

0    0.999998
1    0.999997
2    1.000000
3    0.999997
4    1.000001
dtype: float64
data_type: <class 'numpy.float64'>

If you want to create your custom transformer with any complex logic, please take a look at internal implementation of transformers.

## XDataFrame transformer

To transform a **XDataFrame** one has to specify the transformation logic for the columns that should be transformed using a **XDataFrameTransformer**.

The constructor of **XDataFrameTransformer** input mapping dictionary of {col_name: XSeries transformer}.

For example, let's apply **TimeSeriesWindowTransformer** to the $X$ column and **TimeSeriesTransformer** to the $X_1$ column.

When apply transformation to the column, *it's replaced with transformed*.

In [27]:
from XPandas.transformers import XDataFrameTransformer

In [28]:
df_transformer = XDataFrameTransformer({
    'X': TimeSeriesWindowTransformer(windows_size=4),
    'X_1': TimeSeriesTransformer()
})

In [29]:
df_transformer.fit(df)

XDataFrameTransformer(transformations={'X': [TimeSeriesWindowTransformer(windows_size=4)], 'X_1': [TimeSeriesTransformer(features=None)]})

In [30]:
transformed_df = df_transformer.transform(df)

In [31]:
transformed_df.head()

Unnamed: 0,X_TimeSeriesWindowTransformer,Y,X_1_TimeSeriesTransformer_max,X_1_TimeSeriesTransformer_mean,X_1_TimeSeriesTransformer_median,X_1_TimeSeriesTransformer_min,X_1_TimeSeriesTransformer_quantile_25,X_1_TimeSeriesTransformer_quantile_75,X_1_TimeSeriesTransformer_quantile_90,X_1_TimeSeriesTransformer_quantile_95,X_1_TimeSeriesTransformer_std
0,3 -0.016460 4 -0.789610 5 -1.48761...,0,3.345381,-0.096329,-0.170409,-2.480605,-0.813151,0.652504,1.279456,1.544454,1.074414
1,3 0.416408 4 0.613542 5 0.77304...,1,2.312068,-0.046466,0.096838,-2.158758,-0.861218,0.641127,1.138109,1.522556,0.970865
2,3 -1.315175 4 -1.083680 5 -0.54600...,0,1.937398,0.037578,-0.003737,-2.549828,-0.527034,0.666005,1.210568,1.596621,0.901065
3,3 0.268788 4 0.203539 5 0.13194...,0,2.760552,-0.026058,-0.062914,-2.805194,-0.663694,0.746644,1.200826,1.352352,1.001159
4,3 0.255629 4 0.176381 5 0.10033...,1,2.733081,0.10868,0.123488,-2.797021,-0.565458,0.760682,1.28172,1.567904,1.000222


## Pipeline transformer

Well, that's a nice transformer, but can I create [pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) as in scikit-learn?

Sure! Let's see on example where we combine ``TimeSeriesTransformer`` and ``TimeSeriesWindowTransformer`` into a combined pipeline using a ``PipeLineChain``.

First let's see example of ``PipeLineChain`` with ``XSeries`` and then with ``XDataFrame``.

In [32]:
from XPandas.transformers import PipeLineChain

In [33]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer())
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])

In [34]:
chain.get_params

<bound method Pipeline.get_params of PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None))])>

In [35]:
transformed_X = chain.transform(X)

In [36]:
transformed_X.head()

Unnamed: 0,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_max,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_mean,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_median,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_min,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_quantile_25,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_quantile_75,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_quantile_90,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_quantile_95,None_TimeSeriesWindowTransformer_TimeSeriesTransformer_std
0,2.16144,0.002078,0.017664,-2.48536,-0.633164,0.65868,1.156081,1.403896,0.896272
1,2.39636,-0.002229,-0.030936,-2.27418,-0.649324,0.596704,1.193813,1.580243,0.924516
2,2.32512,0.005656,0.057383,-2.43876,-0.567,0.615622,1.025741,1.343563,0.855456
3,2.4443,0.000632,-0.006893,-2.51592,-0.628024,0.544161,1.332912,1.674656,0.938187
4,2.64094,-0.001295,-0.036442,-2.47826,-0.58778,0.594925,1.130192,1.429327,0.860424


All right! Let's try to add scikit-learn transformer to the PipeLineChain. For example, let's do PCA on transformed_X.

In [37]:
from sklearn.decomposition import PCA

In [38]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5))
])
chain.fit(X)

PipeLineChain(steps=[('moving average trans', TimeSeriesWindowTransformer(windows_size=5)), ('extract features', TimeSeriesTransformer(features=None)), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))])

In [39]:
transformed_X = chain.transform(X)

In [40]:
transformed_X.head()

Unnamed: 0,0,1,2,3,4
0,-0.133152,-0.242552,0.097523,-0.004435,-0.009747
1,-0.125413,0.076021,-0.089267,0.010531,0.017437
2,-0.028607,-0.088828,0.205043,0.098009,0.032338
3,0.071478,-0.058813,-0.247669,-0.02355,-0.052968
4,0.200611,0.110884,0.0642,0.012187,-0.038497


Let's do even more interesting things! Adding a scikit-learn estimator at the end of PipeLineChain!

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

Be sure that types of X_train and X_test are XSeries.

In [43]:
print(type(X_train))
print(type(X_test))

<class 'XPandas.data_container.data_container.XSeries'>
<class 'XPandas.data_container.data_container.XSeries'>


In [44]:
chain = PipeLineChain([
    ('moving average trans', TimeSeriesWindowTransformer(windows_size=5)),
    ('extract features', TimeSeriesTransformer()),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())
    
])
chain = chain.fit(X_train, y_train)

In [45]:
prediction = chain.predict(X_test)

In [46]:
accuracy_score(y_test, prediction)

0.49715678310316813

Let's now try ``PipeLineChain`` with ``XDataFrameTransformer``.

Imagine data set of feature columns gender (0 or 1), age (int), series( pandas.Series), target (0 or 1). Let's try to create ``PipeLineChain`` that extracts features from series and performs ``PCA`` over all feature set and then performs LogitRegression classification.

In [47]:
n = 100

df_features = XDataFrame({
    'gender': XSeries(np.random.binomial(1, 0.7, n)),
    'age': XSeries(np.random.poisson(25, n)),
    'series': XSeries([
        pd.Series(np.random.normal(size=500))
    ] * n)
})

target = XSeries(np.random.binomial(1, 0.45, n))

In [48]:
features_transformer = XDataFrameTransformer({
    'series': TimeSeriesTransformer()
})

In [49]:
pipe_line = PipeLineChain([
    ('extract_from_series', features_transformer),
    ('pca', PCA(n_components=5)),
    ('logit_regression', LogisticRegression())
])

In [50]:
df_features_train, df_features_test, \
        y_train, y_test = train_test_split(df_features, target)

In [51]:
pipe_line.fit(df_features_train, y_train)

PipeLineChain(steps=[('extract_from_series', XDataFrameTransformer(transformations={'series': [TimeSeriesTransformer(features=None)]})), ('pca', PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logit_regression', LogisticRegression(C=1.0, cla...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [52]:
pipe_line.predict(df_features_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0])