
## Introduction

In many data science projects we have the data, but it "is in the wrong format." Luckily re-formatting or reshaping data is a solved problem with many different tools.

For this note, I would like to show how to reshape data using the [data algebra](https://pypi.org/project/data-algebra/)'s [cdata data reshaping tool](https://github.com/WinVector/data_algebra/blob/main/Examples/cdata/cdata.ipynb).

## Example

Let's set up our Python worksheet and start with an example.

In [1]:
# import our modules.
import pandas as pd
from data_algebra import RecordMap, RecordSpecification
from data_algebra.test_util import equivalent_frames
import data_algebra.cdata
from IPython.display import display, HTML

For a [recent plotting task](https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/plot_from_R_example.ipynb) it was convenient generate data in the following format.

In [2]:
# Example data (values changed for legibility) from:
#  https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/sig_pow.ipynb
d_row_form = pd.DataFrame({
    'x': [-0.14, -0.2],
    'control': [4.96, 5.21],
    'treatment': [1.069, 1.16196],
    'control_tail': [2, 3],
    'treatment_tail': [19, 11],
    })

d_row_form

Unnamed: 0,x,control,treatment,control_tail,treatment_tail
0,-0.14,4.96,1.069,2,19
1,-0.2,5.21,1.16196,3,11


The above format is "row records" or records where all values are in a single row (after our examples we will re-discuss this terminology in a larger context).

In many projects one isn't producing the data, so one doesn't have direct control of the record format.

For the [actual plotting](https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/sig_pow.ipynb) it was more convenient to have the data in a different format. We will illustrate the difference by focusing on one row of the data.

In [3]:
# pick one row out as a simpler example
example_row = d_row_form.iloc[[0], :]

example_row

Unnamed: 0,x,control,treatment,control_tail,treatment_tail
0,-0.14,4.96,1.069,2,19


As we said, for plotting it would be more convenient to have the above example data row be in the following format.

In [4]:
# specify, after hard work, what how we wish the example row
# was structured
d_want = pd.DataFrame({
    'x': -0.14,
    'group': ['treatment', 'control'],
    'y': [1.069, 4.96],
    'tail': [19, 2],
})

d_want

Unnamed: 0,x,group,y,tail
0,-0.14,treatment,1.069,19
1,-0.14,control,4.96,2


In this format a single record spans multiple rows.

The job of the data scientist is to work out what formats data is available in, and derive formats that make tasks easier. It took some thought to think of the plotting optimized format, and now we want to realize it.



## Solution

We have data reshaping or melding tools that can finish the task. What we do is convert `d_want` into a data record specification by:

  * Restricting to the "record data content portion" of the data. We treat `x` as a record key (not content) and exclude the `x` column.
  * Replacing the example values with value names.

The record specification can be built up as follows. This is using ideas from [the theory of coordinatized data](https://win-vector.com/tag/coordinatized-data/).

In [5]:
# convert the example result into a data specification
d_specification = pd.DataFrame({"group": d_want["group"]})
# build up frame with names instead of values found in d_want
for new_col, suffix in [('y', ''), ('tail', '_tail')]:
    d_specification[new_col] = [k + suffix for k in d_specification['group']]

d_specification

Unnamed: 0,group,y,tail
0,treatment,treatment,treatment_tail
1,control,control,control_tail


Notice the above record content specification is essentially a copy of the content carrying portion of `d_want`.

Now we take this symbolic data frame and turn it into a complete data record specification by using the `RecordSpecification` class. We specify:

  * What the data record values block looks like.
  * What key columns tell us which record we are working (`record_keys = ['x']`).
  * What key columns tell us which row is which within a record (`control_table_keys = ['group']`).

In [6]:
# upgrade the data specification into a record specification
rs = RecordSpecification(
    d_specification,
    record_keys=['x'],
    control_table_keys=['group'],
    )

rs

Unnamed: 0_level_0,record structure,value,value
Unnamed: 0_level_1,group,y,tail
0,treatment,treatment,treatment_tail
1,control,control,control_tail


Notice the `RecordSpecification` class organizes all of the above concerns together.

A core interface to the `data_algebra` `cdata` data shaping interface is the `RecordMap`. A `RecordMap` takes two primary arguments:

  * `blocks_in`: the specification of the incoming records (`None` used to specify single row records).
  * `blocks_out`: the specification of outgoing records (`None` used to specify single row records).

In [7]:
# show RecordMap take input and output RecordSpecification(s).
help(RecordMap.__init__)

Help on function __init__ in module data_algebra.cdata:

__init__(self, *, blocks_in: Optional[data_algebra.cdata.RecordSpecification] = None, blocks_out: Optional[data_algebra.cdata.RecordSpecification] = None, strict: bool = True)
    Build the transform specification. At least one of blocks_in or blocks_out must not be None.
    
    :param blocks_in: incoming record specification, None for row-records.
    :param blocks_out: outgoing record specification, None for row-records.
    :param strict: if True insist block be strict, and in and out blocks agree on row-form columns.∂



Or we can use a single `RecordSpecification` to generate a complete `RecordMap` by asking for a transform to/from row records (records where all values are in a single row) or to/from keyed column records (records where all values are in a single column). In our case the `.map_from_rows()` specifies the transform we want.

In [8]:
# ask the record specification to design a map from 
# rows to records of our specified form
map_from_rows = rs.map_from_rows()

map_from_rows

Unnamed: 0_level_0,record id,value,value,value,value
Unnamed: 0_level_1,x,treatment,control,treatment_tail,control_tail
0,x record key,treatment value,control value,treatment_tail value,control_tail value

Unnamed: 0_level_0,record id,record structure,value,value
Unnamed: 0_level_1,x,group,y,tail
0,x record key,control,control value,control_tail value
1,x record key,treatment,treatment value,treatment_tail value


Of course, we don't know this really works- until we try it. Let's see the transform in action.


In [9]:
# apply the mapping to our original example data
d_records = map_from_rows(d_row_form)

In [10]:
# define formatted Pandas display
def display_formatted(
    d: pd.DataFrame,
    record_id_cols,
    control_id_cols,
):
    display(HTML(data_algebra.cdata._format_table(
        d,
        record_id_cols=record_id_cols,
        control_id_cols=control_id_cols,
        add_style=True,
    )))

In [11]:
# format the results
display_formatted(
    d_records,
    record_id_cols=rs.record_keys,
    control_id_cols=rs.control_table_keys)


Unnamed: 0_level_0,record id,record structure,value,value
Unnamed: 0_level_1,x,group,y,tail
0,-0.2,control,5.21,3
1,-0.2,treatment,1.16196,11
2,-0.14,control,4.96,2
3,-0.14,treatment,1.069,19


And we now have all of our data transformed into the format we specified.



## Playing a bit more with the system and data

From a tooling point of view, the `cdata` `RecordSpecification` supplies four major convenience functions.

  * `.map_from_rows()`
  * `.map_to_rows()`
  * `.map_from_keyed_column()`
  * `.map_to_keyed_column()`

`.map_from_rows()` and  `.map_to_rows()` map a general record structure to and from rows is the core of the `cdata` data "pivoting" system. The idea is data is in records, and sometimes those records span multiple rows. These are the fundamental operations

`.map_from_keyed_column()` and `.map_to_keyed_column()` map between a general record structure and essentially [RDF Triples](https://en.wikipedia.org/wiki/Semantic_triple). This has a number of direct applications. It is also direct support of concepts such as `melt()` and `cast()`.

About 90% of data reshaping tasks are actually simple maps between "row records" (records where all data is in a single row) and "keyed columns" (or triples, where all but one column are keys). Our example above was a bit more general. For fully general transforms one directly instantiates a `RecordMap` class directly specifying the desired input and output record specifications.



### Inverting the block to row transform

For fun, we can invert the transform to pull data back the other direction (note, row order and column order are considered inessential in this formulation).

In [12]:
# define the inverse map from block records to rows
inv_map = map_from_rows.inverse()
d_recovered = inv_map(d_records)
assert equivalent_frames(d_recovered, d_row_form)

In [13]:
# display our recovered row form
display_formatted(
    d_recovered,
    record_id_cols=rs.row_record_form().record_keys,
    control_id_cols=rs.row_record_form().control_table_keys)

Unnamed: 0_level_0,record id,value,value,value,value
Unnamed: 0_level_1,x,control,control_tail,treatment,treatment_tail
0,-0.2,5.21,3,1.16196,11
1,-0.14,4.96,2,1.069,19


### Keyed column format

An example of the "keyed column" form of our example data is as follows.

In [14]:
# define map from block records to a keyed column format
kc_map = rs.map_to_keyed_column()

kc_map

Unnamed: 0_level_0,record id,record structure,value,value
Unnamed: 0_level_1,x,group,y,tail
0,x record key,treatment,treatment value,treatment_tail value
1,x record key,control,control value,control_tail value

Unnamed: 0_level_0,record id,record structure,value
Unnamed: 0_level_1,x,measure,value
0,x record key,control,control value
1,x record key,control_tail,control_tail value
2,x record key,treatment,treatment value
3,x record key,treatment_tail,treatment_tail value


In [15]:
# apply the map and display the results
triple_form = kc_map(d_records)

display_formatted(
    triple_form,
    record_id_cols=kc_map.blocks_out.record_keys,
    control_id_cols=kc_map.blocks_out.control_table_keys)

Unnamed: 0_level_0,record id,record structure,value
Unnamed: 0_level_1,x,measure,value
0,-0.2,control,5.21
1,-0.2,control_tail,3.0
2,-0.2,treatment,1.16196
3,-0.2,treatment_tail,11.0
4,-0.14,control,4.96
5,-0.14,control_tail,2.0
6,-0.14,treatment,1.069
7,-0.14,treatment_tail,19.0


The above keyed column or triplet form is considered fundamental in some other data transform treatments. However, it depends on all values having the same type (or depends on no type enforcement). Our treatment instead considers the row records to be the fundamental format.

## Some theory and terminology

Now that we have worked some examples together, we have some shared experience we can use to support the following commentary.

Our theory of record transformation is centered on the following potted history.

  * In statistics data is usually organized into rows (instances) and columns (measurements). In statistics this is often called a "[design matrix](https://en.wikipedia.org/wiki/Design_matrix)" or "data frame", and this has a very long history. This promotes a view that many records with one row are a preferred form for many tasks.
  * In computer science data was historically organized into general [records](https://en.wikipedia.org/wiki/Record_(computer_science)) or structured blocks that may have 1 or 2 dimensional or even nested structure. General records and structured record readers were a common extension feature in programming languages such as Fortran, COBOL, Basic, and Pascal.
  * Relational databases, spreadsheets, and [CSV file formats](https://en.wikipedia.org/wiki/Comma-separated_values) promote the view of records having a regular 1 dimensional structure.
  * An alternate view of records having many rows, and only one value carrying column is the core of [semantic triples](https://en.wikipedia.org/wiki/Semantic_triple) or the [entity–attribute–value model](https://en.wikipedia.org/wiki/Entity–attribute–value_model).
  * Converting between record formats is sometimes called "[pivoting](https://en.wikipedia.org/wiki/Pivot_table)", "folding", "stacking", "melting", or "casting" (though some of these operations combine aggregation).

Instead of locking on any of the above formulations, we endorse the fluid view of data from [Codd's "guaranteed access rule"](https://en.wikipedia.org/wiki/Codd%27s_12_rules#Rules):

> Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

(I.e. record structure or row structure is just a convenience framework around locating and naming values by instance id and structure id.)

Given this context, the `cdata` or [coordinatized data](https://win-vector.com/tag/coordinatized-data/) concepts include:

  * Records may have multiple rows and columns. Those with multiple rows are called "block records."
  * Records have two families of keys: those that say which individual we are referring to (record id keys) and those that say which fact in a record we are referring to (record structure keys).
  * Records that have only one row are called "row records", and are a fundamental representation.
  * Records that have only one value column are called "keyed columns" (all the other columns are row and record keys).
  * Fluid conversion between record structures is critical. Some conversions may exchange record id keys and record structure keys, creating new records.




## Conclusion

I strongly feel the specification of data transforms as example records to example records is the correct formulation for data reshaping or pivoting. I also feel that transforms between arbitrary records and rows (not to keyed columns!) is also fundamental. One of these transforms is in fact a join, and the other an aggregation- giving these operations a foundation based on the usual [relational operators](https://en.wikipedia.org/wiki/Relational_algebra).

I think the above system as achieved "data transforms specified by data", which should turn out to be more flexible than transforms specified as code.

The above ideas have implementations in Python as the [`data_algebra` `cdata` methods](https://github.com/WinVector/data_algebra) and in `R` as [`cdata`](https://github.com/WinVector/cdata) (available on [PyPi](https://pypi.org/project/data-algebra/) and [CRAN](https://CRAN.R-project.org/package=cdata) respectively). The `data_algebra` (including `cdata` transforms) works over Pandas, Polars, or in remote databases (without moving data out of the database, the transform itself can be translated into SQL).