The [data algebra](https://github.com/WinVector/data_algebra) is a system for designing data transformations that can be used in Pandas or SQL. The 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer.

I'd like to show demonstrate of the features with an example.

Let's import our packages and some simple example data.

In [1]:
import this

import pandas as pd
from data_algebra.data_ops import descr
import data_algebra.test_util
import data_algebra.BigQuery

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [2]:
d = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'x': [4, 50, 1, 3, 2.2],
    'g': ['a', 'b', 'a', 'a', 'b'],
})

d

Unnamed: 0,id,x,g
0,0,4.0,a
1,1,50.0,b
2,2,1.0,a
3,3,3.0,a
4,4,2.2,b


The data algebra is "Python first", in that we choose method names close to what Pandas and Numpy users expect.  For example to sequentially order rows we might use a method called "cumcount."  We arrange methods that create new columns in a step in a transformation step, in this case an "extend" node.  Node documentation can be found [here](https://github.com/WinVector/data_algebra).

Our example task is, sequentially numbering rows of our data frame. We are using the variable "x" to determine order, and building independent sequences for every set of rows with a given group value "g". The ability to work over an arbitrary number of values and disjoint groups (all in one column) is a hallmark of vectorized or relational calculation. This allows us to work efficiently at large data scales.

In [3]:
ops = (
    descr(d=d)
        .extend(
            {'o': '(1).cumcount()'},
            partition_by=['g'],
            order_by=['x'])
        .order_rows(['id'])
)

Now let's apply our specified transform to our example data. The new column "o" is the ordering of the rows, by "x" per "g" group.

In [4]:
pandas_res = ops.transform(d)

pandas_res

Unnamed: 0,id,x,g,o
0,0,4.0,a,2
1,1,50.0,b,1
2,2,1.0,a,0
3,3,3.0,a,1
4,4,2.2,b,0


This result seems a bit odd, the "counts" start at zero. We can confirm this is the expected Pandas behavior by running the cumcount directly in Pandas.

In [5]:
expect = d.copy()
expect['o'] = d.sort_values(['x']).groupby('g').cumcount('x').sort_index()
# sort index not needed to place column back into expect data frame, as
# row indices will force the sorting at that point. However if we were
# to look at the column we may prefer the .sort_index() version.

expect

Unnamed: 0,id,x,g,o
0,0,4.0,a,2
1,1,50.0,b,1
2,2,1.0,a,0
3,3,3.0,a,1
4,4,2.2,b,0


We confirm this matches the data algebra result.



In [6]:
assert data_algebra.test_util.equivalent_frames(
    pandas_res,
    expect)

At this point I would say: I would not recommend cumcount. I used it as it comes with Pandas. I would prefer a cumsum on a column of all ones.  But let's continue with the cumcount for just a bit longer, before abandoning it.

Part of the power is, the transform can be translated into SQL for execution on different databases.  For example, we could execute this query on Google BigQuery as follows.

We build a database connection and insert our example data. In real applications the data would likely be large, and already in the database.

In [7]:
bigquery_handle = data_algebra.BigQuery.example_handle()
bigquery_handle.insert_table(d, table_name='d', allow_overwrite=True)

(TableDescription(table_name="d", column_names=["id", "x", "g"]))

We, *in principle*, could now run the translated query. In a large scale application we would avoid the motion of data to or from Python by landing the result directly in the database using a `CREATE TABLE` statement.

Let's try to translate this into SQL.

In [8]:
bigquery_sql = bigquery_handle.to_sql(ops)




Notice producing the SQL issued a warning. This is the data algebra system warning that the method chosen may not have a reliable translation into SQL at this time. This is a strong suggestion to try another form. In this case our experience suggests using a cumulative sum instead of a cumulative count.

Let's try cumsum instead.

In [9]:
ops_2 = (
    descr(d=d)
        .extend(
            {'o': '(1).cumsum()'},
            partition_by=['g'],
            order_by=['x'])
        .order_rows(['id'])
)

When we run this query we get a reasonable, but different result. Notice the cumulative sum is inclusive (starts at 1), whereas the cumulative count is not. This is part of our advice to prefer cumulative sums to this count (which is not a count).

Let's try the operation in Pandas.

In [10]:
pandas_res_2 = ops_2.transform(d)

pandas_res_2

Unnamed: 0,id,x,g,o
0,0,4.0,a,3
1,1,50.0,b,2
2,2,1.0,a,1
3,3,3.0,a,2
4,4,2.2,b,1


In [11]:
expect_2 = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'x': [4, 50, 1, 3, 2.2],
    'g': ['a', 'b', 'a', 'a', 'b'],
    'o': [3, 2, 1, 2, 1],
})

assert data_algebra.test_util.equivalent_frames(
    pandas_res_2,
    expect_2)

Let's try this in the database.

In [12]:
bigquery_sql_2 = bigquery_handle.to_sql(ops_2)

print(bigquery_sql_2)

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `extend_0` AS (
  SELECT  -- .extend({ 'o': '(1).cumsum()'}, partition_by=['g'], order_by=['x'])
   `id` ,
   `g` ,
   `x` ,
   SUM(1) OVER ( PARTITION BY `g` ORDER BY `x`  )  AS `o`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .order_rows(['id'])
 *
FROM
 `extend_0`
ORDER BY
 `id`



No warnings in the translation, so let's try that in action.

In [13]:
bigquery_res_2 = bigquery_handle.read_query(bigquery_sql_2)

bigquery_res_2

Unnamed: 0,id,g,x,o
0,0,a,4.0,3
1,1,b,50.0,2
2,2,a,1.0,1
3,3,a,3.0,2
4,4,b,2.2,1


We can confirm the results match.

In [14]:
assert data_algebra.test_util.equivalent_frames(
    pandas_res_2,
    bigquery_res_2)

Let's get back to the topic of warnings. In this case the warning meant, we do not currently have a translation we fully trust of the given method into the target SQL. Some of these we will fix. Some of these we will likely attempt to move users away from. The issue is some fixes are a bit ugly, and if hide that from the user then we have a pathological ["leaky abstraction"](https://en.wikipedia.org/wiki/Leaky_abstraction) that is just hiding nastiness. Many useful abstractions are leaky, but it becomes pathological when the majority of the infrastructure is attempting to hide unavoidable differences. By missing true feedback, the user is delayed in making correct choices.

Let's illustrate this with another example.

Suppose we want to compute the median value of the column "x" for each group of rows identified by column "g". A pipeline to do this is given as follows.

In [15]:
ops_m_w = (
    descr(d=d)
        .extend(
            {'xm': 'x.median()'},
            partition_by=['g'])
        .order_rows(['id'])
)

And this pipeline works in Pandas. It is easy to confirm the correct median values are landed in the appropraite rows of "xm".

In [16]:
ops_m_w.transform(d)

Unnamed: 0,id,x,g,xm
0,0,4.0,a,3.0
1,1,50.0,b,26.1
2,2,1.0,a,3.0
3,3,3.0,a,3.0
4,4,2.2,b,26.1


And let's' see that work in BigQuery.

In [17]:
# should not warn
sql_m_w = bigquery_handle.to_sql(ops_m_w)


In [18]:
# works
bigquery_handle.read_query(sql_m_w)

Unnamed: 0,id,g,x,xm
0,0,a,4.0,3.0
1,1,b,50.0,26.1
2,2,a,1.0,3.0
3,3,a,3.0,3.0
4,4,b,2.2,26.1


Now if we wanted only one row per group during our median calculation we would use the following pipeline, replacing the "extend" with a "project" (trying to stay close to Codd's relational terminology).

In [19]:
ops_m_p = (
    descr(d=d)
        .project(
            {'xm': 'x.median()'},
            group_by=['g'],
        )
)

This pipeline works as follows.

In [20]:
ops_m_p.transform(d)

Unnamed: 0,g,xm
0,a,3.0
1,b,26.1


But we get a warning if we attempt to convert this to BigQuery SQL.

In [21]:
# warns!
sql_m_p = bigquery_handle.to_sql(ops_m_p)



It turns out, we can't ignore the warning. Executing the SQL fails.

In [22]:
# indeed, fails
# Notes: https://stackoverflow.com/a/57718190/6901725
try:
    bigquery_handle.read_query(sql_m_p)
except Exception as ex:
    print(f'caught: {ex}')

caught: 400 percentile_cont aggregate function is not supported.

(job ID: 87d36f60-0623-4c52-9f49-777862a131ee)

                 -----Query Job SQL Follows-----                  

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:-- data_algebra SQL https://github.com/WinVector/data_algebra
   2:--  dialect: BigQueryModel
   3:--       string quote: "
   4:--   identifier quote: `
   5:WITH
   6: `table_reference_0` AS (
   7:  SELECT
   8:   `g` ,
   9:   `x`
  10:  FROM
  11:   `data-algebra-test.test_1.d`
  12: )
  13:SELECT  -- .project({ 'xm': 'x.median()'}, group_by=['g'])
  14: PERCENTILE_CONT(`x`, 0.5) AS `xm` ,
  15: `g`
  16:FROM
  17: `table_reference_0`
  18:GROUP BY
  19: `g`
    |    .    |    .    |    .    |    .    |    .    |    .    |


One familiar with Google BigQuery will recognize the issue. The "PERCENTILE_CONT" function is only available in windowed contexts (the number of rows being returned being the same as the number in the input), and not in project (one row returned per group) contexts.

The failing SQL is this:

In [23]:
print(sql_m_p)


-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `table_reference_0` AS (
  SELECT
   `g` ,
   `x`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .project({ 'xm': 'x.median()'}, group_by=['g'])
 PERCENTILE_CONT(`x`, 0.5) AS `xm` ,
 `g`
FROM
 `table_reference_0`
GROUP BY
 `g`



And the working SQL is this

In [24]:
print(sql_m_w)


-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `extend_0` AS (
  SELECT  -- .extend({ 'xm': 'x.median()'}, partition_by=['g'])
   `id` ,
   `g` ,
   `x` ,
   PERCENTILE_CONT(`x`, 0.5) OVER ( PARTITION BY `g`  )  AS `xm`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .order_rows(['id'])
 *
FROM
 `extend_0`
ORDER BY
 `id`



This can come as a surprise. But the new feature of the data algebra is: the translate to SQL step warned we had a potential problem. This doesn't even require a fill database handle, it is data incorporated into the database model during package assembly.

And that is the newest feature of the 1.3.0 data algebra: per-database SQL translation warnings. I feel the data algebra has about as much breadth or footprint of correct translations as other SQL generators. However, it now is a bit more forthright in saying if your project is in that correct region. This is a help in building complex statistical queries (such as our [t-test example](https://github.com/WinVector/data_algebra/blob/main/Examples/GettingStarted/solving_problems_using_data_algebra.ipynb) or our [xicor example](https://github.com/WinVector/data_algebra/blob/main/Examples/xicor/xicor_frame.ipynb)).


In [25]:
# clean up
bigquery_handle.drop_table('d')
bigquery_handle.close()