I'd like to demonstrate a neat new feature in the [data algebra](https://github.com/WinVector/data_algebra)'s [SQL](https://en.wikipedia.org/wiki/SQL) query generator: common expression elimination.

To begin, let's import our packages.

In [1]:
import pandas as pd

import data_algebra
from data_algebra.sql_format_options import SQLFormatOptions
from data_algebra.data_ops import descr
import data_algebra.BigQuery

We set up our example data.

In [2]:
d = pd.DataFrame({
    'x': [1, 2, 3],
})

d

Unnamed: 0,x
0,1
1,2
2,3


We define a simple operator pipeline in the data algebra.  What we are demonstrating is joining a result against itself. There is no reason we would use such a simple self-join, but it is much easier to see the effects in an example this simple.

This set of operations is actually a directed acyclic graph (or DAG), as the two table descriptions are the same data.


In [3]:
ops = (
    descr(d=d)
        .extend({'y': 'x + 1'})
        .natural_join(
            b=(
                descr(d=d)
                    .extend({'y': 'x + 1'})
                    .extend({'z': '-y'})
            ),
            by=['x'],
            jointype='left',
        )
)

ops

(
    TableDescription(table_name="d", column_names=["x"])
    .extend({"y": "x + 1"})
    .natural_join(
        b=TableDescription(table_name="d", column_names=["x"])
        .extend({"y": "x + 1"})
        .extend({"z": "-(y)"}),
        by=["x"],
        jointype="LEFT",
    )
)

The effect of the transform on a [Pandas data frame](https://pandas.pydata.org) can be seen as follows.

In [4]:
ops.transform(d)

Unnamed: 0,x,y,z
0,1,2,-2
1,2,3,-3
2,3,4,-4


The purpose of the data algebra is to have a convenient query language that translates well into Pandas operations and also into SQL.

So let's translate this operator pipeline into SQL, with the `use_cte_elim` option set to `True`. This option directed the data algebra SQL translator to eliminate common table expressions. That lets us represent our data processing DAG as a DAG in SQL, and avoids expression explosion or generation of redundant sub-expressions.

In [5]:
db_model = data_algebra.BigQuery.BigQueryModel()
sql = db_model.to_sql(
    ops,
    sql_format_options=SQLFormatOptions(
        use_with=True,
        annotate=True,
        use_cte_elim=True,
    )
)

print(sql)

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel 1.3.4
--       string quote: "
--   identifier quote: `
WITH
 `extend_1` AS (
  SELECT  -- .extend({ 'y': 'x + 1'})
   `x` ,
   `x` + 1 AS `y`
  FROM
   `d`
 ) ,
 `extend_3` AS (
  SELECT  -- .extend({ 'z': '-(y)'})
   `x` ,
   `y` ,
   -(`y`) AS `z`
  FROM
   `extend_1`
 )
SELECT  -- _0..natural_join(b= _1, by=['x'], jointype='LEFT')
 COALESCE(`join_source_left_0`.`x`, `join_source_right_0`.`x`) AS `x` ,
 COALESCE(`join_source_left_0`.`y`, `join_source_right_0`.`y`) AS `y` ,
 `z`
FROM
(
 `extend_1` `join_source_left_0`
LEFT JOIN
 `extend_3` `join_source_right_0`
ON (
 `join_source_left_0`.`x` = `join_source_right_0`.`x`
)
)



Notice, not only the table was specified only once, but the initial calculation "`.extend({"y": "x + 1"})`" on it was performed only once. This is what we mean by common expression elimination.

Data algebra queries are machine generated, and only target a subset of SQL. However, in our opinion, the query quality is getting to be quite good.
