This is a small demonstration of the [`data_algebra`](https://github.com/WinVector/data_algebra).

First let's import our packages and set up an example data frame.

In [1]:

import pandas
from data_algebra.data_ops import *


d = pandas.DataFrame({
    'c': ['c', 'c', 'b', 'a'],
    'v': [1, 2, 3, 4],
})

d

Unnamed: 0,c,v
0,c,1
1,c,2
2,b,3
3,a,4


Now let's define our data transform using the data algebra. New columns are defined by specifying a Python dictionary where new column names are the keys and the source-code for the operations are the values. We try to use Codd's names for operators: adding columns is `extend()`, and summarizing data is `project()`.

In [2]:
table_name = 'data-algebra-test.test_1.d'

operations = describe_table(d, table_name=table_name) .\
    extend({
        'g': '"prefix_" %+% c'  # concatenate strings
         }) .\
    project({  # build per- group g totals of v
        'group_total': 'v.sum()'
        },
        group_by=['g']
        ) .\
    order_rows(['g'])  # choose a presentation order of rows


We can then apply these operations to any data frame that has the columns specified in the table description (and appropriate column types).

In [3]:
res_pandas = operations.transform(d)

res_pandas

Unnamed: 0,g,group_total
0,prefix_a,4
1,prefix_b,3
2,prefix_c,3


Applying the same operations in a database is quite simple. First we connect to our database. Here we are inserting the data as an example, in serious applications the source table would usually already be present.

In [4]:
import os
from google.cloud import bigquery
import data_algebra.BigQuery

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/johnmount/big_query/big_query_jm.json"
biqquery_handle = data_algebra.BigQuery.BigQueryModel().db_handle(bigquery.Client())
biqquery_handle.insert_table(d, table_name=table_name, allow_overwrite=True)

Then we can generate SQL tailored to the specific database.

In [5]:
bigquery_sql = biqquery_handle.to_sql(operations, pretty=True)

print(bigquery_sql)

SELECT `g`,
       `group_total`
FROM
  (SELECT `g`,
          SUM(`v`) AS `group_total`
   FROM
     (SELECT ("prefix_"|| `c`) AS `g`,
             `v`
      FROM `data-algebra-test.test_1.d`) `extend_1`
   GROUP BY `g`) `project_2`
ORDER BY `g`


Operations can be used to land results in the database (using `CREATE TABLE AS`, avoiding round-tripping data in and out of the database). Operations can as be used to return results as a Pandas data frame.

In [6]:
res_bigquery = biqquery_handle.read_query(operations)

res_bigquery

Unnamed: 0,g,group_total
0,prefix_a,4
1,prefix_b,3
2,prefix_c,3


And we can check we get equivilent results in Pandas and from the database.

In [7]:
assert res_pandas.equals(res_bigquery)

In [8]:
biqquery_handle.close()

We can repeat the database example using another database, simply by building a different database handle.

In [9]:
import sqlite3
import data_algebra.SQLite

sqlite_handle = data_algebra.SQLite.SQLiteModel().db_handle(sqlite3.connect(":memory:"))

sqlite_sql = sqlite_handle.to_sql(operations, pretty=True)

print(sqlite_sql)

SELECT "g",
       "group_total"
FROM
  (SELECT "g",
          SUM("v") AS "group_total"
   FROM
     (SELECT ('prefix_'|| "c") AS "g",
             "v"
      FROM "data-algebra-test.test_1.d") "extend_1"
   GROUP BY "g") "project_2"
ORDER BY "g"


In [10]:
sqlite_handle.insert_table(d, table_name=table_name, allow_overwrite=True)
res_sqlite = sqlite_handle.read_query(operations)

res_sqlite

Unnamed: 0,g,group_total
0,prefix_a,4
1,prefix_b,3
2,prefix_c,3


In [11]:
assert res_sqlite.equals(res_bigquery)

Also, operations have a pretty good printing method.

In [12]:
operations

TableDescription(
 table_name='data-algebra-test.test_1.d',
 column_names=[
   'c', 'v']) .\
   extend({
    'g': "'prefix_'.concat(c)"}) .\
   project({
    'group_total': 'v.sum()'},
   group_by=['g']) .\
   order_rows(['g'])

And that is a small demonstration of the data algebra.