Improved DPlyr-Like Python Interface #302

Mytherin · 2019-10-08T15:29:57Z

For Python, we want to implement a dplyr-like interface so users do not need to use SQL to express their queries. Instead, they can manipulate the data using operations that will be converted into DuckDB physical operators and executed using the DuckDB query engine. Operators can be chained using a dplyr-like syntax (e.g. by overloading the '>>' operator).

Expressions are still modeled as strings, as this allows for easy combination of multiple expressions. They are parsed as SQL expressions, e.g. by prefixing them with 'SELECT ', throwing them in the parser and then extracting the operations again.

Example 1: read from CSV

import duckdb
# csv loading
read_csv = duckdb.from_csv('test.csv', sep='\t', columns=['C1', 'C2'], types=[duckdb.int32, duckdb.int32])
# perform filter and aggr
res = read_csv.filter('C1 > 100 AND C1 < 120').aggr("COUNT(*)", groups="C2")
# print or convert to Pandas DF triggers execution and materialization
print(res)
df = res.to_df()

Example 2: use persistent storage

# open database file
db = duckdb.open('database.db')
# read CSV again
read_csv = duckdb.from_csv('test.csv', sep='\t', columns=['C1', 'C2'], types=[duckdb.int32, duckdb.int32])
# write into table, this will trigger write into persistent storage
read_csv.store(db, 'test')
# read from table and filter again
db.table('test').filter('C1 > 100').print()

Example 3: update/delete

# open database file, reading table 'test' from disk again
db = duckdb.open('database.db')

# delete some rows, this will affect the persistent storage
db.table('test').filter('C1 < 0').delete()
# update rows, this will affect the persistent storage
db.table('test').filter('C1 = 999').update('C1 = NULL')

hannes · 2019-10-08T18:24:11Z

I think this is a great idea. I am not sure about the >> syntax, I would much prefer the

res = read_csv.filter('C1 > 100 AND C1 < 120').aggr("COUNT(*)", groups="C2")

Syntax. The reason why dplyr has the %>% is because calling functions on S3 objects in R cannot use postfix notation.

xhochy · 2019-10-08T20:25:05Z

The Python equivalent to dplyr is https://docs.ibis-project.org/notebooks/tutorial/2-Basics-Aggregate-Filter-Limit.html.

Mytherin · 2019-10-09T03:29:56Z

Implementation Ideas

Every node returns a duckdb.Relation object. This object has a list of names and a list of types. The node has functions for printing it (print) and for performing additional operators. The node itself does not represent any materialized data initially, only on calls to e.g. print or to_df() will the relation be materialized by executing the query.

The following operators seem necessary (perhaps there are some I am forgetting):

Sources

from_csv(filename, separator, names, types, ...): Read a CSV file and produce a Relation
from_df(pandas_df): Read a Pandas DF and produce a Relation
db.table(table_name): Read a table from the database and produce a Relation

Manipulation

project(select_list): Returns a new Relation with the names/types of the given select_list.
filter(predicate): Applies the filter to the Relation, the names/types of the result do not change
aggregate(aggregates, groups): Performs the (optionally grouped) aggregations. Returns a Relation that holds the groups and the aggregate results.
join(target, on, type=duckdb.InnerJoin): Performs a join with the target relation. on can either be a single string column (in case the names are the same) or a condition (e.g. l_orderkey=o_orderkey). Returns a Relation with the combined columns of both input relations.
cross_product(target): Performs a cross product with the target relation, similar to join but without join conditions.
limit(n, offset=0): Performs a limit+offset operation, the names/types of the result do not change
order(order_list): Performs an ORDER BY operation, the names/types of the result do not change.
distinct(): Performs a DISTINCT operation, the names/types of the result do not change.

Modifying Data

All of the modify operations directly trigger execution and do not return a relation.

delete(): Delete the input tuples. Delete() can only be used if there is a single db.table() source.
update(update_list): Update the input tuples to the desired values. Update() can only be used if there is a single db.table() source.
store(db, table_name, temp=False, overwrite=False): Store the Relation as a table with the given name in the provided database, potentially overwriting the table if the option is set.
append(db, table_name): Append to an existing table in the given database. Throws an error if the table does not exist or the columns do not match.

Result Visualization/Conversion

head(): Fetch a single chunk from the result and print to screen in pretty format
print(): Materialize result and print to screen in pretty format
to_df(): Convert relation to a Pandas DataFrame

More Examples

import duckdb
# table: integers(i INTEGER)
# query: SELECT SUM(i)+2*COUNT(*) FROM integers WHERE i>10
db = duckdb.connect('test.db')
# produces a relation -> names: ['i'] types: [INTEGER]
tbl = db.table('integers')
# returns a relation with the same names/types  -> names: ['i'] types: [INTEGER]
tbl = tbl.filter('i > 10')
# aggr without groups, optional aliases can be provided
aggr = tbl.aggr(['SUM(i)', 'COUNT(*)'], names=["sum", "count"])
# projection now can only refer to the aggr columns
proj = aggr.project('sum * 2+count')
proj.print()

For lists, I would argue we allow you to either provide a Python list, or to perform the comma delimination inside the string itself, i.e. the following are identical:

tbl.project(['i', 'i+2'])
tbl.project('i, i+2')

Actual Implementation

All of these operators map very naturally to our individual LogicalOperators, and by creating a logical plan we can run all our optimizers prior to execution. The composability where everything returns a Relation seems very clean and clear to me as well.

The main implementation difficulty here is I think parsing the expressions and binding them, however, this should not be very complicated either. The expressions can be parsed using our SQL parser, and then column references can be resolved by looking at the input relation and turned into BoundColumnReferences. Other operators/functions will be similarly bound.

Catalog lookups happen only in the db.table operator. Note here that we should keep a transaction open not just during execution, but also during the construction, because e.g. if db.table binds a table it should not be allowed for another transaction to then delete that table as that would produce a crash.

Functions are a concern as well, as they live in the catalog of a database as well, however we can simply resort to only resolving built-in functions here instead of performing a lookup in the DB catalog.

Updates/Deletions

Updates and deletions rely on row identifiers to figure out which rows to modify. As such, they must be bound to a base table. The following operations are legal:

# change entire table
db.table('integers').delete()
db.table('integers').update('i=NULL')
# change subset of table
db.table('integers').filter('i>10').delete()
# limit/order by can also be used
db.table('integers').filter('i>10').order('i DESC').limit(10).delete()

However, any operation that produces a new Relation (project, aggr, join, cross_product) cannot be used in conjunction with these operators.

Multiple Databases

This interface also allows easily combining tables from different databases, although some care must be taken w.r.t. transactions here as operating on multiple databases at once also means having multiple open transactions. Modifications (update/delete/append) should always only concern a single database, however, and as such not pose much of a problem.

xhochy · 2019-10-09T04:25:31Z

@Mytherin That is a very close description of Ibis, definitely there are slight differences in the API but Ibis is also a dplyr-like frontend to multiple databases that builds up an expression tree. You should definitely check it out before replicating it.

hannes · 2019-10-09T07:52:11Z

Indeed we should check if we could extend ibis to do what we envision.

Mytherin · 2019-10-09T13:04:39Z

I would say we should not limit this interface to Python, we can create the interface in C++ and then create a Python layer on top of it. At least for the C++ interface I suggest we implement the proposed interface, for Python we can consider building the Ibis interface on top of the proposed interface in C++.

Looking at Ibis I can see that it has some similarities but a bunch of differences as well that I don't really like, e.g.

table.group_by(keys).aggregate([metric]): The grouping and aggregate being separate introduces a lot of problems from a user perspective because the group_by cannot really return a relation (at least in Pandas) but instead returns a special "grouping object". This is problematic from an interface design perspective and not so clean imo. Having the aggregate and grouping be together makes this much cleaner as the aggregate function simply returns a relation just like all the other operators.
table[table.bigint_col > 50]: This kind of design seems clean for simple cases but gets pretty ugly when you want to combine multiple conditions imo. A simple condition like i > 50 AND i < 100 now becomes table[(table.i > 50) & (table.i < 100)], with required brackets because of operator precedence, and you get very weird errors if you instead do the more logical table[table.i > 50 & table.i < 100] or table[table.i > 50 and table.i < 100]. Doing like filter('i > 50 AND i < 100') seems cleaner and more user-friendly to me. We can use prepared statements to allow for variables, e.g. filter('i > $1 AND i < $2', 50, 100). This is also consistent with the SQL interface.
Having all these different functions to achieve SQL-like behavior also introduces a large unnecessary learning curve to people that are familiar with SQL already, e.g. look at this snippet from the tutorial:

cond1 = table.bigint_col > 50
cond2 = table.int_col.between(2, 7)
table[cond1 | cond2].count()

I would argue this is cleaner and more readable:

table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').aggregate('COUNT(*)')
# or if we make a count(*) shortcut, which seems sensible:
table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').count()

I'm not entirely opposed to implementing the ibis like interface as I know people in the Python community are used to it, but I think the interface proposed here is cleaner and more user-friendly, especially for people familiar with SQL. Having worked with Pandas (which has a very similar interface) I think it has a number of design flaws that have a number of very surprising behaviors. The proposed interface also maps very cleanly to the implementation of our database, which makes it easy to implement and maintain.

hannes · 2019-11-04T10:44:56Z

Also interesting is a programmatic way to construct CREATE TABLE statements

hannes · 2020-04-23T09:39:07Z

This is being worked on in the pyrel branch

hannes added the Python label Oct 8, 2019

xhochy mentioned this issue Oct 9, 2019

SQLAlchemy dialect #305

Closed

hannes mentioned this issue Mar 26, 2020

Switch Python API to pybind11 #472

Closed

Mytherin mentioned this issue Apr 10, 2020

Relation API #509

Merged

hannes closed this as completed Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved DPlyr-Like Python Interface #302

Improved DPlyr-Like Python Interface #302

Mytherin commented Oct 8, 2019 •

edited

Loading

hannes commented Oct 8, 2019

xhochy commented Oct 8, 2019

Mytherin commented Oct 9, 2019 •

edited

Loading

xhochy commented Oct 9, 2019

hannes commented Oct 9, 2019

Mytherin commented Oct 9, 2019 •

edited

Loading

hannes commented Nov 4, 2019

hannes commented Apr 23, 2020

Improved DPlyr-Like Python Interface #302

Improved DPlyr-Like Python Interface #302

Comments

Mytherin commented Oct 8, 2019 • edited Loading

Example 1: read from CSV

Example 2: use persistent storage

Example 3: update/delete

hannes commented Oct 8, 2019

xhochy commented Oct 8, 2019

Mytherin commented Oct 9, 2019 • edited Loading

Implementation Ideas

Sources

Manipulation

Modifying Data

Result Visualization/Conversion

More Examples

Actual Implementation

Updates/Deletions

Multiple Databases

xhochy commented Oct 9, 2019

hannes commented Oct 9, 2019

Mytherin commented Oct 9, 2019 • edited Loading

hannes commented Nov 4, 2019

hannes commented Apr 23, 2020

Mytherin commented Oct 8, 2019 •

edited

Loading

Mytherin commented Oct 9, 2019 •

edited

Loading

Mytherin commented Oct 9, 2019 •

edited

Loading