# Ordered Grouping Example

## Introduction

I'd like to share an example of data-wrangling/data-reshaping and how to solve it in [`Python`](https://www.python.org)/[`Pandas`](https://pandas.pydata.org) using [`data_algebra`](https://github.com/WinVector/data_algebra) (the `R` version of this example can be found [`here`](https://github.com/WinVector/cdata/blob/master/Examples/OrderedGrouping/OrderedGrouping.md)).

In an RStudio Community note, user <code>hklovs</code> asked [how to re-organize some data](https://community.rstudio.com/t/tidying-data-reorganizing-tibble/48292).

The solution is: 

  * Get a good definition of what is wanted
  * Re-process the data so any advisory column you wished you had is actually there
  * And finish the problem.
  
## The problem

In this example the ask was equivalent to:

<blockquote>
How do I transform data from this format:

| ID | OP | DATE                |
| -: | :- | :------------------ |
|  1 | A  | 2001-01-02 00:00:00 |
|  1 | B  | 2015-04-25 00:00:00 |
|  2 | A  | 2000-04-01 00:00:00 |
|  3 | D  | 2014-04-07 00:00:00 |
|  4 | C  | 2012-12-01 00:00:00 |
|  4 | A  | 2005-06-16 00:00:00 |
|  4 | D  | 2009-01-20 00:00:00 |
|  4 | B  | 2009-01-20 00:00:00 |
|  5 | A  | 2010-10-10 00:00:00 |
|  5 | B  | 2003-11-09 00:00:00 |
|  6 | B  | 2004-01-09 00:00:00 |

Into this format:

| ID | DATE1               | OP1 | DATE2               | OP2         | DATE3               | OP3 |
| -: | :------------------ | :-- | :------------------ | :---------- | :------------------ | :-- |
|  1 | 2001-01-02 00:00:00 | A   | 2015-04-25 00:00:00 | B           | NA                  | NA  |
|  2 | 2000-04-01 00:00:00 | A   | NA                  | NA          | NA                  | NA  |
|  3 | 2014-04-07 00:00:00 | D   | NA                  | NA          | NA                  | NA  |
|  4 | 2005-06-16 00:00:00 | A   | 2009-01-20 00:00:00 | c(“B”, “D”) | 2012-12-01 00:00:00 | C   |
|  5 | 2003-11-09 00:00:00 | B   | 2010-10-10 00:00:00 | A           | NA                  | NA  |
|  6 | 2004-01-09 00:00:00 | B   | NA                  | NA          | NA                  | NA  |
</blockquote>

## The solution

What the ask translates to is: per `ID` pick the first three operations
ordered by date, merging operations with the same timestamp. Then write
these results into a single row for each `ID`.

The first step isn’t to worry about the data format, it is an
inessential or solvable difficulty. Instead make any extra descriptions
or controls you need explicit. In this case we need ranks. So let’s
first add those.


In [1]:
# bring in all of our modues/packages
import io
import re
import sqlite3

import pandas

import data_algebra.util
from data_algebra.cdata import *
from data_algebra.data_ops import *
import data_algebra.SQLite

In [2]:
# some example data
d = pandas.DataFrame({
    'ID': [1, 1, 2, 3, 4, 4, 4, 4, 5, 5, 6],
    'OP': ['A', 'B', 'A', 'D', 'C', 'A', 'D', 'B', 'A', 'B', 'B'],
    'DATE': ['2001-01-02 00:00:00', '2015-04-25 00:00:00', '2000-04-01 00:00:00', 
             '2014-04-07 00:00:00', '2012-12-01 00:00:00', '2005-06-16 00:00:00', 
             '2009-01-20 00:00:00', '2009-01-20 00:00:00', '2010-10-10 00:00:00', 
             '2003-11-09 00:00:00', '2004-01-09 00:00:00'],
    })

d

Unnamed: 0,ID,OP,DATE
0,1,A,2001-01-02 00:00:00
1,1,B,2015-04-25 00:00:00
2,2,A,2000-04-01 00:00:00
3,3,D,2014-04-07 00:00:00
4,4,C,2012-12-01 00:00:00
5,4,A,2005-06-16 00:00:00
6,4,D,2009-01-20 00:00:00
7,4,B,2009-01-20 00:00:00
8,5,A,2010-10-10 00:00:00
9,5,B,2003-11-09 00:00:00


In [3]:
# define a user aggregation function
def sorted_concat(vals):
    return ', '.join(sorted([str(vi) for vi in set(vals)]))

# specify the first few data processing steps
ops = describe_table(d, table_name='d'). \
        project({'OP': user_fn(sorted_concat, 'OP')},
                group_by=['ID', 'DATE']). \
        extend({'rank': '_row_number()'},
               partition_by=['ID'],
               order_by=['DATE'])

# specify the first few data processing steps
d2 = ops.transform(d)

d2

Unnamed: 0,ID,DATE,OP,rank
0,1,2001-01-02 00:00:00,A,1
1,1,2015-04-25 00:00:00,B,2
2,2,2000-04-01 00:00:00,A,1
3,3,2014-04-07 00:00:00,D,1
4,4,2005-06-16 00:00:00,A,1
5,4,2009-01-20 00:00:00,"B, D",2
6,4,2012-12-01 00:00:00,C,3
7,5,2003-11-09 00:00:00,B,1
8,5,2010-10-10 00:00:00,A,2
9,6,2004-01-09 00:00:00,B,1


In the above code we used the `sorted_concat()` operator to combine rows with duplicates combined into vectors such as `c("B", "D")`.  Then we added a rank column.  This gets us much closer to a complete solution.   All we have to do now is re-arrange the data.

First data re-arrangement we strongly encourage drawing out what one wants it terms of one input record and one output record.  With `data_algebra` doing so essentially solves the problem.

So let's look at what happens only to the rows with `ID == 1`.  In this case we expect input rows that look like this:

| ID | DATE                | OP | rank |
| -: | :------------------ | :- | ---: |
|  1 | 2001-01-02 00:00:00 | A  |    1 |
|  1 | 2015-04-25 00:00:00 | B  |    2 |

And we want this record transformed into
this:

| ID | DATE1               | OP1 | DATE2               | OP2 | DATE3 | OP3 |
| -: | :------------------ | :-- | :------------------ | :-- | :---- | :-- |
|  1 | 2001-01-02 00:00:00 | A   | 2015-04-25 00:00:00 | B   | NA    | NA  |

The `data_algebra` data shaping rule is: draw a picture of any non-trivial
(more than one row) data records in their full generality. In our case
the interesting record is the following (with the record `ID` columns suppressed for conciseness).


In [4]:
def diagram_to_pandas(s):
    s = s.strip()
    s = re.sub(r'"', '', s)
    return pandas.read_table(sep='\\s*,\\s*', engine='python', filepath_or_buffer=io.StringIO(s))


diagram = diagram_to_pandas("""


    "rank",    "DATE",    "OP"
    "1",       DATE1,     OP1
    "2",       DATE2,     OP2
    "3",       DATE3,     OP3


""")

diagram

Unnamed: 0,rank,DATE,OP
0,1,DATE1,OP1
1,2,DATE2,OP2
2,3,DATE3,OP3


The column names `rank`, `DATE`, and `OP` are all column names of the table we are starting with.  The values `1`, `2`, and `3` are all values we expect to see in the `rank` column of the working data frame.  And the symbols `DATE1`, `DATE2`, `DATE3`, `OP1`, `OP2`, and `OP3` are all stand-in names for values we see in our data.  These symbols will be the column names of our new row-records.

We have tutorials on how to build these diagrams [here](https://winvector.github.io/cdata/articles/design.html) and [here](https://winvector.github.io/cdata/articles/blocksrecs.html).  Essentially we draw one record of the input and output and match column names to stand-in interior values of the other.  The output record is a single row, so we don't have to explicitly pass it in.  However it looks like the following.

In [5]:
row_record = diagram_to_pandas("""

  "DATE1", "OP1", "DATE2", "OP2", "DATE3", "OP3"
   DATE1 ,  OP1 ,  DATE2 ,  OP2 ,  DATE3 ,  OP3

""")

row_record

Unnamed: 0,DATE1,OP1,DATE2,OP2,DATE3,OP3
0,DATE1,OP1,DATE2,OP2,DATE3,OP3


Notice the interior-data portions (the parts we wrote in the inputs as unquoted) of each table input are the cells that are matched from one record to the other.  These are in fact just the earlier sample inputs and outputs with the values replaced with the placeholders `DATE1`, `DATE2`, `DATE3`, `OP1`, `OP2`, and `OP3`.

With this diagram in hand we can specify the data reshaping step.

In [6]:
record_map = RecordMap(
    blocks_in=RecordSpecification(
        control_table=diagram,
        record_keys=['ID']
    ))

The transform specifies that records are found in the format shown in diagram, and are to be converted to rows.  We can confirm the intent by printing the transform.

In [None]:
print(str(record_map))