Example of data transforms as categorical arrows ([`R` version](https://github.com/WinVector/rquery/blob/master/Examples/Arrow/Arrow.md) [`Python` version](https://github.com/WinVector/data_algebra/blob/master/Examples/Arrow/Arrow.md)).

(For ideas on applying category theory to science and data, please see David I Spivak, *Category Theory for the Sciences*, MIT Press, 2014.)

The [Python `data_algebra` package](https://github.com/WinVector/data_algebra) supplies a number of operators for working with tabular data.  The operators are picked in reference to [Codd's relational algebra](https://en.wikipedia.org/wiki/Relational_algebra), though (as with [`SQL`](https://en.wikipedia.org/wiki/SQL)) we do not insist on table rows being unique. Many of the operations are simple: selecting rows, selecting columns, joining tables.  Two of the operations stand out: projecting or aggregating rows, and extending tables with new derived columns.

An interesting point is: while the `data_algebra` operators are fairly generic: the operator pipelines that map a single table to a single table form the arrows of a category over a nice set of objects.

The objects of this category can be either of:

 * Sets of column names.
 * Maps of column names to column types (schema-like objects).
 
I will take a liberty and call these objects (with or without types) "single table schemas."

Our setup is easiest to explain with an example.  Let's work an example in `Python`.

First we import our packages and instantiate an example data frame.

In [1]:
import pandas

from data_algebra.data_ops import *
from data_algebra.arrow import *

d = pandas.DataFrame({
    'g': ['a', 'b', 'b', 'c', 'c', 'c'],
    'x': [1, 4, 5, 7, 8, 9],
    'v': [10.0, 40.0, 50.0, 70.0, 80.0, 90.0],
    'i': [True, True, False, False, False, False],
})

d

Unnamed: 0,g,x,v,i
0,a,1,10.0,True
1,b,4,40.0,True
2,b,5,50.0,False
3,c,7,70.0,False
4,c,8,80.0,False
5,c,9,90.0,False


`data_algebra` operator pipelines are designed to transform data.  For example we can define the following operator pipeline which is designed count how many different values there are for `g`, and assign a unique integer id to each group.

In [2]:
table_description = describe_table(d, table_name='d')

id_ops_a = table_description. \
    project(group_by=['g']). \
    extend({
        'ngroup': '_row_number()',
    },
    order_by=['g'])

The pipeline is saved in the variable `id_ops_a` which can then be applied to our data as follows.

In [3]:
id_ops_a.transform(d)

Unnamed: 0,g,ngroup
0,a,1
1,b,2
2,c,3


The pipelines are designed for composition in addition to application to data.  For example we can use the `id_ops_a` pipeline as part of a larger pipeline as follows.

In [4]:
id_ops_b = table_description. \
    natural_join(id_ops_a, by=['g'], jointype='LEFT')

This pipeline specifies joining the integer group ids back into the original table as follows.

In [5]:
id_ops_b.transform(d)

Unnamed: 0,g,x,v,i,ngroup
0,a,1,10.0,True,1
1,b,4,40.0,True,2
2,b,5,50.0,False,2
3,c,7,70.0,False,3
4,c,8,80.0,False,3
5,c,9,90.0,False,3


Notice the `ngroup` column is a function of the `g` column in this result.

I am now ready to state my big point.  These pipelines have documented pre and post conditions: what set of columns (and optionally types) they expect on their input, and what set of columns (optionally types) the pipeline produces.

In [6]:
# needs
id_ops_b.columns_used()

{'d': {'g', 'i', 'v', 'x'}}

In [7]:
# produced
id_ops_b.column_names

('g', 'x', 'v', 'i', 'ngroup')

This is where we seem to have nice opportunity to use category theory to manage our pre-and post conditions.  Let's wrap this pipeline into a convenience class to make the categorical connection easier to see.

In [8]:
a1 = DataOpArrow(id_ops_b)

`a1` is a categorical theory arrow, it has the usual domain (arrow base, or incoming object), and co-domain (arrow head, or outgoing object) in a category of single-table schemas.

In [9]:
a1.dom()

DataOpArrow(
 (TableDescription(table_name="data_frame", column_names=["g", "x", "v", "i"]))
,
 free_table_key='data_frame')

In [10]:
a1.cod()

DataOpArrow(
 (TableDescription(table_name="data_frame", column_names=["g", "i", "ngroup", "v", "x"]))
,
 free_table_key='data_frame')

These are what are presented in the succinct presentation of the arrow.

In [11]:
print(a1)

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, ngroup, v, x ]
]



The arrow has a more detailed presentation, which is the realization of the operator pipeline as code.

In [12]:
print(a1.__repr__())

DataOpArrow(
 (
    TableDescription(table_name="d", column_names=["g", "x", "v", "i"]).natural_join(
        b=TableDescription(table_name="d", column_names=["g", "x", "v", "i"])
        .project({}, group_by=["g"])
        .extend({"ngroup": "_row_number()"}, partition_by=1, order_by=["g"]),
        on=["g"],
        jointype="LEFT",
    )
)
,
 free_table_key='d')


We can think of our arrows (or obvious mappings of them) as being able to be applied to:
  * More arrows of the same type (composition).
  * Data (action or application).
  * Single table schemas (managing pre and post conditions).
  


Arrows can be composed or applied by using the notation `a1.transform(d)` or the equivalent notation `d >> a1`. Note: we are not thinking of `>>` itself as an arrow, but as a symbol for composition of arrows (we used `>>` as it is one of the few operators not used by `Pandas`, which means using this operator makes it easier for our notation to work with `Pandas`).

In [13]:
a1.transform(d)

Unnamed: 0,g,x,v,i,ngroup
0,a,1,10.0,True,1
1,b,4,40.0,True,2
2,b,5,50.0,False,2
3,c,7,70.0,False,3
4,c,8,80.0,False,3
5,c,9,90.0,False,3


In [14]:
d >> a1

Unnamed: 0,g,x,v,i,ngroup
0,a,1,10.0,True,1
1,b,4,40.0,True,2
2,b,5,50.0,False,2
3,c,7,70.0,False,3
4,c,8,80.0,False,3
5,c,9,90.0,False,3


Up until now we have been showing how we work to obey the category theory axioms.  From here on we look at what does category theory do for us.  What it does is check correct composition and ensure full associativity of operations.

As is typical in category theory, there can be more than one arrow from a given object to given object.  For example the following is a different arrow with the same start and end.

In [15]:
a1b = DataOpArrow(
    table_description. \
        extend({
            'ngroup': 0
        }))

print(a1b)

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, ngroup, v, x ]
]



However, the `a1b` arrow represents a different operation than `a1`:

In [16]:
a1b.transform(d)

Unnamed: 0,g,x,v,i,ngroup
0,a,1,10.0,True,0
1,b,4,40.0,True,0
2,b,5,50.0,False,0
3,c,7,70.0,False,0
4,c,8,80.0,False,0
5,c,9,90.0,False,0


The arrows can be composed exactly when the pre-conditions meet the post conditions.  

Here are two examples of violating the pre and post conditions.  The point is, the categorical conditions enforce the checking for us.  We can't compose arrows that don't match domain and range.  Up until now we have been setting things up to make the categorical machinery work, now this machinery will work for us and make the job of managing complex data transformations easier.

In [17]:
cols2_too_small = [c for c in (set(id_ops_b.column_names) - set(['i']))]
ordered_ops = TableDescription(table_name='d2', column_names=cols2_too_small). \
    extend({
        'row_number': '_row_number()',
        'shift_v': 'v.shift()',
    },
    order_by=['x'],
    partition_by=['g'])
a2 = DataOpArrow(ordered_ops)
print(a2)

[
 'd2':
  [ g, ngroup, x, v ]
   ->
  [ g, ngroup, row_number, shift_v, v, x ]
]



In [18]:
try:
    a1 >> a2
except ValueError as e:
    print(str(e))

extra incoming columns: {'i'}


In [19]:
cols2_too_large = list(id_ops_b.column_names) + ['q']
ordered_ops = TableDescription(table_name='d2', column_names=cols2_too_large). \
    extend({
        'row_number': '_row_number()',
        'shift_v': 'v.shift()',
    },
    order_by=['x'],
    partition_by=['g'])
a2 = DataOpArrow(ordered_ops)
print(a2)

[
 'd2':
  [ g, x, v, i, ngroup, q ]
   ->
  [ g, i, ngroup, q, row_number, shift_v, v, x ]
]



In [20]:
try:
    a1 >> a2
except ValueError as e:
    print(str(e))


missing required columns: {'q'}


The point is: we will never see the above exceptions when we compose arrows that match on pre and post conditions (which in category theory are the only arrows you are allowed to compose).

When the pre and post conditions are met the arrows compose in a fully associative manner.

In [21]:
ordered_ops = TableDescription(table_name='d2', column_names=id_ops_b.column_names). \
    extend({
        'row_number': '_row_number()',
        'shift_v': 'v.shift()',
    },
    order_by=['x'],
    partition_by=['g'])
a2 = DataOpArrow(ordered_ops)
print(a2)

[
 'd2':
  [ g, x, v, i, ngroup ]
   ->
  [ g, i, ngroup, row_number, shift_v, v, x ]
]



In [22]:
print(a1 >> a2)

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, ngroup, row_number, shift_v, v, x ]
]



We can also enforce type invariants.

In [23]:
wrong_example = pandas.DataFrame({
    'g': ['a'],
    'v': [1.0],
    'x': ['b'],
    'i': [True],
    'ngroup': [1]
})

print(a2)

[
 'd2':
  [ g, x, v, i, ngroup ]
   ->
  [ g, i, ngroup, row_number, shift_v, v, x ]
]



In [24]:
try:
    a1 >> a2
except Exception as ex:
    print(str(ex))

We can add yet another set of operations to our pipeline: computing a per-group variable `mean`.

In [25]:
unordered_ops = TableDescription(table_name='d3', column_names=ordered_ops.column_names). \
    extend({
        'mean_v': 'v.mean()',
    },
    partition_by=['g'])

a3 = DataOpArrow(unordered_ops)

print(a3)

[
 'd3':
  [ g, x, v, i, ngroup, row_number, shift_v ]
   ->
  [ g, i, mean_v, ngroup, row_number, shift_v, v, x ]
]



The three arrows can form a composite pipeline that computes a number of interesting per-group statistics all at once.

In [26]:
print(a1 >> a2 >> a3)

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, mean_v, ngroup, row_number, shift_v, v, x ]
]



And, we the methods are fully associative (can be grouped in any sequence that is still in the original order).

In [27]:
print((a1 >> a2) >> a3)

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, mean_v, ngroup, row_number, shift_v, v, x ]
]



In [28]:
print(a1 >> (a2 >> a3))

[
 'd':
  [ g, x, v, i ]
   ->
  [ g, i, mean_v, ngroup, row_number, shift_v, v, x ]
]



All the compositions are in fact the same arrow, as we can see by using it on data.

In [29]:
((a1 >> a2) >> a3).transform(d)

Unnamed: 0,g,x,v,i,ngroup,row_number,shift_v,mean_v
0,a,1,10.0,True,1,1,,10.0
1,b,4,40.0,True,2,1,,45.0
2,b,5,50.0,False,2,2,40.0,45.0
3,c,7,70.0,False,3,1,,80.0
4,c,8,80.0,False,3,2,70.0,80.0
5,c,9,90.0,False,3,3,80.0,80.0


In [30]:
(a1 >> (a2 >> a3)).transform(d)

Unnamed: 0,g,x,v,i,ngroup,row_number,shift_v,mean_v
0,a,1,10.0,True,1,1,,10.0
1,b,4,40.0,True,2,1,,45.0
2,b,5,50.0,False,2,2,40.0,45.0
3,c,7,70.0,False,3,1,,80.0
4,c,8,80.0,False,3,2,70.0,80.0
5,c,9,90.0,False,3,3,80.0,80.0


The combination operator `>>` is fully associative over the combination of data and arrows.

The underlying `data_algebra` steps compute and check very similar pre and post conditions, the arrow class is just making this look more explicitly like arrows moving through objects in category.

The data arrows operate over three different value domains:

 * single table schemas (transforming single table schemas)
 * their own arrow space (i.e. composition)
 * data frames (transforming data as an action)

We can also demonstrate identity arrows.

In [31]:
id = DataOpArrow(describe_table(d))
print(id)

[
 'data_frame':
  [ g, x, v, i ]
   ->
  [ g, i, v, x ]
]



In [32]:
id >> id

DataOpArrow(
 (TableDescription(table_name="data_frame", column_names=["g", "x", "v", "i"]))
,
 free_table_key='data_frame')

In [33]:
id.transform(d)
 

Unnamed: 0,g,x,v,i
0,a,1,10.0,True
1,b,4,40.0,True
2,b,5,50.0,False
3,c,7,70.0,False
4,c,8,80.0,False
5,c,9,90.0,False



An introduction to the <code>data_algebra</code> package can be found [here](https://github.com/WinVector/data_algebra/blob/master/Examples/Introduction/data_algebra_Introduction.md).

Some notes on the category theory influence on the design of the <code>data_algebra</code> package can be found [here](https://github.com/WinVector/data_algebra/blob/master/Examples/Arrow/CDesign.md).

An example of treating a 2-argument data operation (such as a join) as an arrow can be found [here](https://github.com/WinVector/rquery/blob/master/Examples/Arrow/JoinArrow.md).

