## Introduction

The [data algebra](https://github.com/WinVector/data_algebra) is a system for designing data transformations that can be used in Pandas or SQL. The 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer.

## An Example

I'd like to show demonstrate of the features with an example.

Let's import our packages and some simple example data.

In [1]:
import pandas as pd
from data_algebra.data_ops import descr
import data_algebra.test_util
import data_algebra.BigQuery

In [2]:
d = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'x': [4, 50, 1, 3, 2.2],
    'g': ['a', 'b', 'a', 'a', 'b'],
})

d

Unnamed: 0,id,x,g
0,0,4.0,a
1,1,50.0,b
2,2,1.0,a
3,3,3.0,a
4,4,2.2,b


The data algebra is "Python first", in that we choose method names close to what Pandas and Numpy users expect.  We arrange methods that create new columns in a step in a transformation step, in this case an "extend" node.  Node documentation can be found [here](https://github.com/WinVector/data_algebra).

Our example task is, computing the median of the "x" columns for each group of rows identified by the "g" column. The ability to work over an arbitrary number of values and disjoint groups (all in one column) is a hallmark of vectorized or relational calculation. This allows us to work efficiently at large data scales.

### The Solution

With some experience we can write the data algebra solution as follows.

In [3]:

ops = (
    descr(d=d)
        .extend(
            {'xm': 'x.median()'},
            partition_by=['g'])
        .order_rows(['id'])
)


The "extend()" and "order_rows()" are operators, which have an introduction [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Introduction/data_algebra_Introduction.ipynb). What methods we can use in these nodes follows mostly Pandas and Numpy, and is in a table [here](https://github.com/WinVector/data_algebra/blob/main/Examples/Methods/op_catalog.csv).

Now let's apply our specified transform to our example data. The new column "xm" has the correct group medians assigned to each original row.

In [4]:
pandas_res = ops.transform(d)

pandas_res

Unnamed: 0,id,x,g,xm
0,0,4.0,a,3.0
1,1,50.0,b,26.1
2,2,1.0,a,3.0
3,3,3.0,a,3.0
4,4,2.2,b,26.1


## In Database

Part of the power of the data algebra is: the transform can be translated into SQL for execution on different databases.  For example, we could try to execute this query on Google BigQuery as follows.

We build a database connection and insert our example data. In real applications the data would likely be large, and already in the database.

In [5]:
bigquery_handle = data_algebra.BigQuery.example_handle()
bigquery_handle.insert_table(d, table_name='d', allow_overwrite=True)

(TableDescription(table_name="d", column_names=["id", "x", "g"]))

We can now run the translated query. In a large scale application we would avoid the motion of data to or from Python by landing the result directly in the database using a `CREATE TABLE` statement.

Let's try to translate this into SQL.

In [6]:
bigquery_sql = bigquery_handle.to_sql(ops)


And let's' see that work in BigQuery.

In [7]:
# works
db_res = bigquery_handle.read_query(bigquery_sql)

db_res

Unnamed: 0,g,x,id,xm
0,a,4.0,0,3.0
1,b,50.0,1,26.1
2,a,1.0,2,3.0
3,a,3.0,3,3.0
4,b,2.2,4,26.1


In [8]:
assert data_algebra.test_util.equivalent_frames(pandas_res, db_res)

## A Variation

Now if we wanted only one row per group during our median calculation we would use the following pipeline, replacing the "extend" with a "project" (trying to stay close to Codd's relational terminology).

In [9]:
ops_p = (
    descr(d=d)
        .project(
            {'xm': 'x.median()'},
            group_by=['g'],
        )
)

This pipeline works as follows.

In [10]:
pandas_res_p = ops_p.transform(d)

pandas_res_p

Unnamed: 0,g,xm
0,a,3.0
1,b,26.1


But we get a warning if we attempt to convert this to BigQuery SQL.

In [11]:
# warns!
sql_p = bigquery_handle.to_sql(ops_p)



It turns out, we can't ignore the warning. Attempting to execute the SQL fails.

In [12]:
# indeed, fails
# Notes: https://stackoverflow.com/a/57718190/6901725
try:
    bigquery_handle.read_query(sql_p)
except Exception as ex:
    print(f'caught: {ex}')

caught: 400 percentile_cont aggregate function is not supported.

(job ID: d65e53c4-7338-4123-b4b1-e7ee11e4f626)

                 -----Query Job SQL Follows-----                  

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:-- data_algebra SQL https://github.com/WinVector/data_algebra
   2:--  dialect: BigQueryModel
   3:--       string quote: "
   4:--   identifier quote: `
   5:WITH
   6: `table_reference_0` AS (
   7:  SELECT
   8:   `x` ,
   9:   `g`
  10:  FROM
  11:   `data-algebra-test.test_1.d`
  12: )
  13:SELECT  -- .project({ 'xm': 'x.median()'}, group_by=['g'])
  14: PERCENTILE_CONT(`x`, 0.5) AS `xm` ,
  15: `g`
  16:FROM
  17: `table_reference_0`
  18:GROUP BY
  19: `g`
    |    .    |    .    |    .    |    .    |    .    |    .    |


One familiar with Google BigQuery will recognize the issue. The "PERCENTILE_CONT" function is only available in windowed contexts (the number of rows being returned being the same as the number in the input), and not in project (one row returned per group) contexts.

The failing SQL is this:

In [13]:
print(sql_p)


-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `table_reference_0` AS (
  SELECT
   `x` ,
   `g`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .project({ 'xm': 'x.median()'}, group_by=['g'])
 PERCENTILE_CONT(`x`, 0.5) AS `xm` ,
 `g`
FROM
 `table_reference_0`
GROUP BY
 `g`



And the working SQL is this

In [14]:
print(bigquery_sql)


-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `extend_0` AS (
  SELECT  -- .extend({ 'xm': 'x.median()'}, partition_by=['g'])
   `g` ,
   `x` ,
   `id` ,
   PERCENTILE_CONT(`x`, 0.5) OVER ( PARTITION BY `g`  )  AS `xm`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .order_rows(['id'])
 *
FROM
 `extend_0`
ORDER BY
 `id`



The above failure can come as a surprise. But the new feature of the data algebra is: the "translate to SQL" step warned we had a potential problem. This doesn't even require a full database handle, it is data incorporated into the database model during package assembly.

## Patching The Solution

We can work around the BigQuery limitation by simulating the project-median by the execute-median, followed by a project-mean step. However, we feel automating such a conversion would hide too many details from the user.

In [15]:
ops_p_2 = (
    ops  # start with our extend median solution
        .project(
            {'xm': 'xm.mean()'},  # pseudo-aggregation, xm constant per group
            group_by=['g'],
        )
)

db_res_p = bigquery_handle.read_query(ops_p_2)

db_res_p

Unnamed: 0,xm,g
0,3.0,a
1,26.1,b


In [16]:
assert data_algebra.test_util.equivalent_frames(pandas_res_p, db_res_p)

## Conclusion

And that is the newest feature of the 1.3.0 data algebra: per-database SQL translation warnings. I feel the data algebra has about as much breadth or footprint of correct translations as other SQL generators. However, it now is a bit more forthright in saying if your project is in that correct region. This is a help in building complex statistical queries (such as our [t-test example](https://github.com/WinVector/data_algebra/blob/main/Examples/GettingStarted/solving_problems_using_data_algebra.ipynb) or our [xicor example](https://github.com/WinVector/data_algebra/blob/main/Examples/xicor/xicor_frame.ipynb)).


## Appendix

We built up ops_p_2 by adding a step to ops. The data algebra has minor optimizers both in the pipeline and SQL steps. For example, we can see in the combined pipeline the intermediate `order_rows()` node is eliminated.

In [17]:
ops_p_2

(
    TableDescription(table_name="d", column_names=["id", "x", "g"])
    .extend({"xm": "x.median()"}, partition_by=["g"])
    .project({"xm": "xm.mean()"}, group_by=["g"])
)

### Clean Up

In [18]:
# clean up
bigquery_handle.drop_table('d')
bigquery_handle.close()