A quick demonstration of how we specify operator pipelines.

This example is taken from the [README](https://github.com/WinVector/data_algebra/blob/main/README.ipynb)
of the [data_algebra](https://github.com/WinVector/data_algebra) package.

We import our packages.

In [1]:

import pandas as pd

from data_algebra import descr, d_, one
import data_algebra.BigQuery


We set up some example data.

In [2]:
d = pd.DataFrame({
    'subjectID':[1, 1, 2, 2],
    'surveyCategory': [ "withdrawal behavior", "positive re-framing", "withdrawal behavior", "positive re-framing"],
    'assessmentTotal': [5., 2., 3., 4.],
    'irrelevantCol1': ['irrel1']*4,
    'irrelevantCol2': ['irrel2']*4,
})

d

Unnamed: 0,subjectID,surveyCategory,assessmentTotal,irrelevantCol1,irrelevantCol2
0,1,withdrawal behavior,5.0,irrel1,irrel2
1,1,positive re-framing,2.0,irrel1,irrel2
2,2,withdrawal behavior,3.0,irrel1,irrel2
3,2,positive re-framing,4.0,irrel1,irrel2


We specify our operations in our common manner, using quoted expressions which are then parsed.

In [3]:
scale = 0.237

ops = data_algebra.data_ops.describe_table(d, 'd'). \
    extend({'probability': f'(assessmentTotal * {scale}).exp()'}). \
    extend({'total': 'probability.sum()'},
           partition_by=['subjectID']). \
    extend({'probability': 'probability/total'}). \
    extend({'row_number': '(1).cumsum()'},
           partition_by=['subjectID'],
           order_by=['probability'], reverse=['probability']). \
    select_rows('row_number == 1'). \
    select_columns(['subjectID', 'surveyCategory', 'probability']). \
    rename_columns({'diagnosis': 'surveyCategory'})

This produces an operator pipeline that can be used on Pandas data frames.

In [4]:
ops.transform(d)

Unnamed: 0,subjectID,diagnosis,probability
0,1,withdrawal behavior,0.670622
1,2,positive re-framing,0.558974


The operators can also be translated into SQL for use in large data stores.

In [5]:
handle = data_algebra.BigQuery.BigQueryModel().db_handle(conn=None)

sql = handle.to_sql(ops)

print(sql)

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel 1.6.5
--       string quote: "
--   identifier quote: `
WITH
 `table_reference_0` AS (
  SELECT
   `subjectID` ,
   `surveyCategory` ,
   `assessmentTotal`
  FROM
   `d`
 ) ,
 `extend_1` AS (
  SELECT  -- .extend({ 'probability': '(assessmentTotal * 0.237).exp()'})
   `subjectID` ,
   `surveyCategory` ,
   EXP(`assessmentTotal` * 0.237) AS `probability`
  FROM
   `table_reference_0`
 ) ,
 `extend_2` AS (
  SELECT  -- .extend({ 'total': 'probability.sum()'}, partition_by=['subjectID'])
   `subjectID` ,
   `surveyCategory` ,
   `probability` ,
   SUM(`probability`) OVER ( PARTITION BY `subjectID`  )  AS `total`
  FROM
   `extend_1`
 ) ,
 `extend_3` AS (
  SELECT  -- .extend({ 'probability': 'probability / total'})
   `subjectID` ,
   `surveyCategory` ,
   `probability` / `total` AS `probability`
  FROM
   `extend_2`
 ) ,
 `extend_4` AS (
  SELECT  -- .extend({ 'row_number': '(1).cumsum()'}, partition

We can also build up operation pipelines without using quoted expressions.
This is by using the `d_.c` "column" notation to access column definitions, and the `val()` "value" notation to
inject values (only needed when we are not interacting values with columns, otherwise
we can use values directly).

The above pipeline can be specified in that manner as follows.

In [6]:
scale = 0.237

ops2 = (
    descr(d=d)
        .extend({"probability": (d_.assessmentTotal * scale).exp()})
        .extend({"total": d_.probability.sum()}, partition_by=["subjectID"])
        .extend({"probability": d_.probability / d_.total})
        .extend(
            {"row_number": one.cumsum()},
            partition_by=["subjectID"],
            order_by=["probability"],
            reverse=["probability"],
        )
        .select_rows(d_.row_number == 1)
        .select_columns(
            ["subjectID", "surveyCategory", "probability"])
        .rename_columns({"diagnosis": "surveyCategory"})
)

In [7]:
ops2

(
    TableDescription(
        table_name="d",
        column_names=[
            "subjectID",
            "surveyCategory",
            "assessmentTotal",
            "irrelevantCol1",
            "irrelevantCol2",
        ],
    )
    .extend({"probability": "(assessmentTotal * 0.237).exp()"})
    .extend({"total": "probability.sum()"}, partition_by=["subjectID"])
    .extend({"probability": "probability / total"})
    .extend(
        {"row_number": "(1).cumsum()"},
        partition_by=["subjectID"],
        order_by=["probability"],
        reverse=["probability"],
    )
    .select_rows("row_number == 1")
    .select_columns(["subjectID", "surveyCategory", "probability"])
    .rename_columns({"diagnosis": "surveyCategory"})
)


In [8]:
ops2.transform(d)


Unnamed: 0,subjectID,diagnosis,probability
0,1,withdrawal behavior,0.670622
1,2,positive re-framing,0.558974


The idea is, in this mode expressions are specified without the need for quotes around column names.
