==================
DataFrame tutorial
==================

This is a short tutorial with examples for the ``dataframe`` library.

---------------------------
Creating a DataFrame object
---------------------------

If you want to use data frames, first import it. For demonstration purposes we also include some datasets:

In [10]:
from dataframe import DataFrame
from dataframe import GroupedDataFrame
from sklearn import datasets
import re
iris_data = datasets.load_iris()

This will load all the data from ``sklearn``. In particular we use the iris dataset, which goes back to Ronald Fisher I think. From the iris dataset, we take the feature names and covariables for each feature and put it into a dictionary.

In [4]:
features = [re.sub("\s|cm|\(|\)", "", x) for x in iris_data.feature_names] 
print(features)

['sepallength', 'sepalwidth', 'petallength', 'petalwidth']


In [5]:
data = { features[i]: iris_data.data[:,i] for i in range(len(iris_data.data[1,:])) }
data["target"] = iris_data.target

We can take the dictionary to create a ``DataFrame`` object out of it using:

In [7]:
frame = DataFrame(**data)

Notice that we use the ``**kwargs`` syntax to give a ``dict()`` to the constructor. Alternatively you can just call the constructor like this:

In [8]:
frame_expl = DataFrame(sepallength=iris_data.data[:,0],
                       sepalwidth=iris_data.data[:,1],
                       petallength=iris_data.data[:,2],
                       petalwidth=iris_data.data[:,3],
                       target=iris_data.target)

The results are the same, only that the second approach is more verbose and we have to enter the arguments manually.

In [9]:
print("Frame kwargs:")
print(frame + "\n")
print("Frame verbose:")
print(frame_expl)

Frame kwargs:
+-------------+------------+-------------+------------+--------+
| petallength | petalwidth | sepallength | sepalwidth | target |
+-------------+------------+-------------+------------+--------+
|     1.4     |    0.2     |     5.1     |    3.5     |   0    |
|     1.4     |    0.2     |     4.9     |    3.0     |   0    |
|     1.3     |    0.2     |     4.7     |    3.2     |   0    |
|     1.5     |    0.2     |     4.6     |    3.1     |   0    |
|     1.4     |    0.2     |     5.0     |    3.6     |   0    |
|      .      |     .      |      .      |     .      |   .    |
|      .      |     .      |      .      |     .      |   .    |
|      .      |     .      |      .      |     .      |   .    |
+-------------+------------+-------------+------------+--------+
Frame verbose:
+-------------+------------+-------------+------------+--------+
| petallength | petalwidth | sepallength | sepalwidth | target |
+-------------+------------+-------------+------------+------

Note that upon instantiation the column names are sorted alphabetically. 

In [None]:
-------------------------
Using the DataFrame class
-------------------------

Basically ``DataFrame`` has four nice features. I will explain them one at a time

In [None]:
Subsetting DataFrame columns
----------------------------

``subset`` lets you select some columns from the original DataFrame and returns a new DataFrame object,

In [14]:
targets = frame.subset("target")

Aggregating DataFrame columns
-----------------------------

``aggregate`` takes one or multiple columns and computes an aggregation function. With the aggregated values a new DataFrame object is returned. **Beware** that your aggregation function returns a **scalar**, e.g. a ``float``. First we need to write a class that extends ``Callable`` and overwrites ``__call__``. For example if you want to have an implementation of the ``mean`` you can do:

In [17]:
from dataframe import Callable
import functools

class Mean(Callable):
    def __call__(self, *args):
        vals = args[0].values()
        return reduce(lambda x, y: x + y, vals) / len(vals)

Now you can aggregate the frame like this:

In [None]:
agg = frame.aggregate(Mean, "mean", "petallength")
print(agg["mean"][0])

Modifying DataFrame columns
---------------------------

Similar to ``aggregate`` we can ``modify`` several columns. To do that again write a class extending ``Callable``. **Beware** that unlike in aggregation, modification requires to give a list of the **same size** as your original column length. For example:

In [19]:
print(len(frame["target"].values()))

150

So if we call ``modify`` on a column in our ``frame`` the result has to be of length ``150``. 
As an example lets standardize the column ``pentallength``.

In [None]:
import scipy.stats as sps

class Zscore(Callable):
    def __call__(self, *args):
        vals = args[0].values()
        return sps.zscore(vals).tolist()
    
sta= frame.modify(Zscore, "zscore", "petallength")
print(sta["zscore"][0:5])