I would like to show how to reshape data using the [data algebra](https://pypi.org/project/data-algebra/)'s [cdata data reshaping tool](https://github.com/WinVector/data_algebra/blob/main/Examples/cdata/cdata.ipynb).

Let's set up our Python worksheet and start with an example.

In [1]:
# import our modules.
import pandas as pd
from data_algebra import RecordMap, RecordSpecification
from data_algebra.test_util import equivalent_frames



For a [recent plotting task](https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/plot_from_R_example.ipynb) it was convenient generate data in the following format.

In [2]:
# Example data (values changed for legibility) from:
#  https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/sig_pow.ipynb
d_row_form = pd.DataFrame({
    'x': [-0.14, -0.2],
    'control': [4.96, 5.21],
    'treatment': [1.069, 1.16196],
    'control_tail': [2, 3],
    'treatment_tail': [19, 11],
    })

d_row_form

Unnamed: 0,x,control,treatment,control_tail,treatment_tail
0,-0.14,4.96,1.069,2,19
1,-0.2,5.21,1.16196,3,11


However, for the [actual plotting](https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/sig_pow.ipynb) it was more convenient to have the data in a different format. We take a single row of our example data below.

In [3]:
# pick one row out as a simpler example
example_row = d_row_form.iloc[[0], :]

example_row

Unnamed: 0,x,control,treatment,control_tail,treatment_tail
0,-0.14,4.96,1.069,2,19


For plotting it would be more convenient to have the above example data be in the following format.

In [4]:
# specify, after hard work, what how we wish the example row
# was structured
d_want = pd.DataFrame({
    'x': -0.14,
    'group': ['treatment', 'control'],
    'y': [1.069, 4.96],
    'tail': [19, 2],
})

d_want

Unnamed: 0,x,group,y,tail
0,-0.14,treatment,1.069,19
1,-0.14,control,4.96,2


The job of the data scientist is to work out what formats data is available in, and derive formats that make tasks easier. It took some thought to think of the plotting otimized format, and now we want to realize it.

We have data reshaping or melding tools that can finish the task. What we do is convert `d_want` into a data record specification by:

  * Restricting to the "record data content portion" of the data. We treat `x` as a record key (not content) and exclude the `x` column.
  * Replacing the example values with value names.

The record specification can be built up as follows. This is using ideas from [the theory of coordinatized data](https://win-vector.com/tag/coordinatized-data/).

In [5]:
# convert the example result into a data specification
d_specification = pd.DataFrame({
    "group": d_want["group"]
})
# replace values with names
for new_col, suffix in [('y', ''), ('tail', '_tail')]:
    d_specification[new_col] = [k + suffix for k in d_specification['group']]

d_specification

Unnamed: 0,group,y,tail
0,treatment,treatment,treatment_tail
1,control,control,control_tail


Now we take this symbolic data frame and turn it into a complete data record specification by using the `RecordSpecification` class. We specify:

  * What the data record values block looks like.
  * What key columns tell us which record we are working (`record_keys = ['x']`).
  * What key columns tell us which row is which within a record (`control_table_keys = ['group']`).

In [6]:
# upgrade the data specification into a record specification
rs = RecordSpecification(
    d_specification,
    record_keys=['x'],
    control_table_keys=['group'],
    )

rs

Unnamed: 0_level_0,record structure,value,value
Unnamed: 0_level_1,group,y,tail
0,treatment,treatment,treatment_tail
1,control,control,control_tail


Notice the `RecordSpecification` class organizes all of the above concerns together.

From our `RecordSpecification` we can then implement our desired record transform. Let's implement the data transform and print it out to confirm it claims to do what we want.

In [7]:
# ask the record specification to design a map from 
# rows to records of our specified form
map_from_rows = rs.map_from_rows()

map_from_rows

Unnamed: 0_level_0,record id,value,value,value,value
Unnamed: 0_level_1,x,treatment,control,treatment_tail,control_tail
0,x record key,treatment value,control value,treatment_tail value,control_tail value

Unnamed: 0_level_0,record id,record structure,value,value
Unnamed: 0_level_1,x,group,y,tail
0,x record key,control,control value,control_tail value
1,x record key,treatment,treatment value,treatment_tail value


Of course, we don't know this really works- until we try it. Let's see the transform in action.


In [8]:
# apply the mapping to our original example data
d_records = map_from_rows(d_row_form)

d_records

Unnamed: 0,x,group,y,tail
0,-0.2,control,5.21,3
1,-0.2,treatment,1.16196,11
2,-0.14,control,4.96,2
3,-0.14,treatment,1.069,19


And we now have all of our data transformed into the format we specified.

We can even invert the transform to pull data back the other direction (note, row order and column order are considered inessential in this formulation).

In [9]:
inv_map = map_from_rows.inverse()
d_recovered = inv_map(d_records)
assert equivalent_frames(d_recovered, d_row_form)

d_recovered

Unnamed: 0,x,control,control_tail,treatment,treatment_tail
0,-0.2,5.21,3,1.16196,11
1,-0.14,4.96,2,1.069,19


From a practical point of view, we are done.

From a theoretical point of view: the `cdata` `RecordSpecification` supplies four major services.

  * `.map_from_rows()`
  * `.map_to_rows()`
  * `.map_from_keyed_column()`
  * `.map_to_keyed_column()`

`.map_from_rows()` and  `.map_to_rows()` map a general record structure to and from rows is the core of the `cdata` data "pivoting" system. The idea is data is in records, and sometimes those records span multiple rows. These are the fundemental operations

`.map_from_keyed_column()` and `.map_to_keyed_column()` map between a general record structure and essentially [RDF Triples](https://en.wikipedia.org/wiki/Semantic_triple). This has a number of direct applications. It is also direct support of concepts such as `melt()` and `cast()`.

About 90% of data reshaping tasks are actually simple maps between "row records" (records where all data is in a single row) and "keyed columns" (or triples, where all but one column are keys). Our example above was a bit more general. For fully general transforms one directly instantiates a `RecordMap` class, as it allows general input and output fixed record structure.


I strongly feel the specification of data transforms as example records to example records is the correct formulation for data rehaping or pivoting. I also feel that transforms between arbitrary records and rows (not to keyed columns!) is also fundemenatal. One of these is a join, and the other an aggregation- giving a foundation from the usual [relational operators](https://en.wikipedia.org/wiki/Relational_algebra).