Quick demo showing data_alebra working on three different potential scales of data: pandas.DataFrame, dask.dataframe.DataFrame, and SQL.


First set up our (trivial) example.

In [1]:
import pandas
import dask.dataframe
import psycopg2

from data_algebra.data_ops import *
import data_algebra.PostgreSQL


d_pandas = pandas.DataFrame({
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8]
})
d_pandas

Unnamed: 0,x,y
0,1,5
1,2,6
2,3,7
3,4,8


Define our (trivial) operator pipeline.

In [2]:
ops = TableDescription('d', ['x', 'y']) .\
    extend({'z': 'x + y'})
ops

TableDescription(table_name='d', column_names=['x', 'y']) .\
   extend({'z': 'x + y'})

Apply operators to pandas.DataFrame

In [3]:
ops.transform(d_pandas)

Unnamed: 0,x,y,z
0,1,5,6
1,2,6,8
2,3,7,10
3,4,8,12


Set up a dask example.

In [4]:
d_dask = dask.dataframe.from_pandas(d_pandas, npartitions=2)

Apply the same operators to the dask data structure.


In [5]:
r_dask = ops.transform(d_dask)
r_dask

Unnamed: 0_level_0,x,y,z
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,int64,int64,int64
,...,...,...
,...,...,...


Call .compute() to get the result back.

In [6]:
r_dask.compute()

Unnamed: 0,x,y,z
0,1,5,6
1,2,6,8
0,3,7,10
1,4,8,12


Now the same thing in SQL with PostgreSQL.

First set up our database, and simulate having the data already in the database by copying the data over.

In [7]:
conn_p = psycopg2.connect(
    database="johnmount",
    user="johnmount",
    host="localhost",
    password=""
)
conn_p.autocommit=True

db_model_p = data_algebra.PostgreSQL.PostgreSQLModel()

db_model_p.insert_table(conn_p, d_pandas, 'd')

sql = ops.to_sql(db_model_p, pretty=True)
print(sql)

SELECT "y",
       "x",
       "x" + "y" AS "z"
FROM
  (SELECT "y",
          "x"
   FROM "d") "sq_0"


And execute the SQL

In [8]:
db_model_p.read_query(conn_p, sql)

Unnamed: 0,y,x,z
0,5.0,1.0,6.0
1,6.0,2.0,8.0
2,7.0,3.0,10.0
3,8.0,4.0,12.0


Clean up.

In [9]:
conn_p.close()