Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved DPlyr-Like Python Interface #302

Closed
Mytherin opened this issue Oct 8, 2019 · 8 comments
Closed

Improved DPlyr-Like Python Interface #302

Mytherin opened this issue Oct 8, 2019 · 8 comments

Comments

@Mytherin
Copy link
Collaborator

Mytherin commented Oct 8, 2019

For Python, we want to implement a dplyr-like interface so users do not need to use SQL to express their queries. Instead, they can manipulate the data using operations that will be converted into DuckDB physical operators and executed using the DuckDB query engine. Operators can be chained using a dplyr-like syntax (e.g. by overloading the '>>' operator).

Expressions are still modeled as strings, as this allows for easy combination of multiple expressions. They are parsed as SQL expressions, e.g. by prefixing them with 'SELECT ', throwing them in the parser and then extracting the operations again.

Example 1: read from CSV

import duckdb
# csv loading
read_csv = duckdb.from_csv('test.csv', sep='\t', columns=['C1', 'C2'], types=[duckdb.int32, duckdb.int32])
# perform filter and aggr
res = read_csv.filter('C1 > 100 AND C1 < 120').aggr("COUNT(*)", groups="C2")
# print or convert to Pandas DF triggers execution and materialization
print(res)
df = res.to_df()

Example 2: use persistent storage

# open database file
db = duckdb.open('database.db')
# read CSV again
read_csv = duckdb.from_csv('test.csv', sep='\t', columns=['C1', 'C2'], types=[duckdb.int32, duckdb.int32])
# write into table, this will trigger write into persistent storage
read_csv.store(db, 'test')
# read from table and filter again
db.table('test').filter('C1 > 100').print()

Example 3: update/delete

# open database file, reading table 'test' from disk again
db = duckdb.open('database.db')

# delete some rows, this will affect the persistent storage
db.table('test').filter('C1 < 0').delete()
# update rows, this will affect the persistent storage
db.table('test').filter('C1 = 999').update('C1 = NULL')
@hannes
Copy link
Member

hannes commented Oct 8, 2019

I think this is a great idea. I am not sure about the >> syntax, I would much prefer the

res = read_csv.filter('C1 > 100 AND C1 < 120').aggr("COUNT(*)", groups="C2")

Syntax. The reason why dplyr has the %>% is because calling functions on S3 objects in R cannot use postfix notation.

@hannes hannes added the Python label Oct 8, 2019
@xhochy
Copy link
Contributor

xhochy commented Oct 8, 2019

The Python equivalent to dplyr is https://docs.ibis-project.org/notebooks/tutorial/2-Basics-Aggregate-Filter-Limit.html.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Oct 9, 2019

Implementation Ideas

Every node returns a duckdb.Relation object. This object has a list of names and a list of types. The node has functions for printing it (print) and for performing additional operators. The node itself does not represent any materialized data initially, only on calls to e.g. print or to_df() will the relation be materialized by executing the query.

The following operators seem necessary (perhaps there are some I am forgetting):

Sources

  • from_csv(filename, separator, names, types, ...): Read a CSV file and produce a Relation
  • from_df(pandas_df): Read a Pandas DF and produce a Relation
  • db.table(table_name): Read a table from the database and produce a Relation

Manipulation

  • project(select_list): Returns a new Relation with the names/types of the given select_list.
  • filter(predicate): Applies the filter to the Relation, the names/types of the result do not change
  • aggregate(aggregates, groups): Performs the (optionally grouped) aggregations. Returns a Relation that holds the groups and the aggregate results.
  • join(target, on, type=duckdb.InnerJoin): Performs a join with the target relation. on can either be a single string column (in case the names are the same) or a condition (e.g. l_orderkey=o_orderkey). Returns a Relation with the combined columns of both input relations.
  • cross_product(target): Performs a cross product with the target relation, similar to join but without join conditions.
  • limit(n, offset=0): Performs a limit+offset operation, the names/types of the result do not change
  • order(order_list): Performs an ORDER BY operation, the names/types of the result do not change.
  • distinct(): Performs a DISTINCT operation, the names/types of the result do not change.

Modifying Data

All of the modify operations directly trigger execution and do not return a relation.

  • delete(): Delete the input tuples. Delete() can only be used if there is a single db.table() source.
  • update(update_list): Update the input tuples to the desired values. Update() can only be used if there is a single db.table() source.
  • store(db, table_name, temp=False, overwrite=False): Store the Relation as a table with the given name in the provided database, potentially overwriting the table if the option is set.
  • append(db, table_name): Append to an existing table in the given database. Throws an error if the table does not exist or the columns do not match.

Result Visualization/Conversion

  • head(): Fetch a single chunk from the result and print to screen in pretty format
  • print(): Materialize result and print to screen in pretty format
  • to_df(): Convert relation to a Pandas DataFrame

More Examples

import duckdb
# table: integers(i INTEGER)
# query: SELECT SUM(i)+2*COUNT(*) FROM integers WHERE i>10
db = duckdb.connect('test.db')
# produces a relation -> names: ['i'] types: [INTEGER]
tbl = db.table('integers')
# returns a relation with the same names/types  -> names: ['i'] types: [INTEGER]
tbl = tbl.filter('i > 10')
# aggr without groups, optional aliases can be provided
aggr = tbl.aggr(['SUM(i)', 'COUNT(*)'], names=["sum", "count"])
# projection now can only refer to the aggr columns
proj = aggr.project('sum * 2+count')
proj.print()

For lists, I would argue we allow you to either provide a Python list, or to perform the comma delimination inside the string itself, i.e. the following are identical:

tbl.project(['i', 'i+2'])
tbl.project('i, i+2')

Actual Implementation

All of these operators map very naturally to our individual LogicalOperators, and by creating a logical plan we can run all our optimizers prior to execution. The composability where everything returns a Relation seems very clean and clear to me as well.

The main implementation difficulty here is I think parsing the expressions and binding them, however, this should not be very complicated either. The expressions can be parsed using our SQL parser, and then column references can be resolved by looking at the input relation and turned into BoundColumnReferences. Other operators/functions will be similarly bound.

Catalog lookups happen only in the db.table operator. Note here that we should keep a transaction open not just during execution, but also during the construction, because e.g. if db.table binds a table it should not be allowed for another transaction to then delete that table as that would produce a crash.

Functions are a concern as well, as they live in the catalog of a database as well, however we can simply resort to only resolving built-in functions here instead of performing a lookup in the DB catalog.

Updates/Deletions

Updates and deletions rely on row identifiers to figure out which rows to modify. As such, they must be bound to a base table. The following operations are legal:

# change entire table
db.table('integers').delete()
db.table('integers').update('i=NULL')
# change subset of table
db.table('integers').filter('i>10').delete()
# limit/order by can also be used
db.table('integers').filter('i>10').order('i DESC').limit(10).delete()

However, any operation that produces a new Relation (project, aggr, join, cross_product) cannot be used in conjunction with these operators.

Multiple Databases

This interface also allows easily combining tables from different databases, although some care must be taken w.r.t. transactions here as operating on multiple databases at once also means having multiple open transactions. Modifications (update/delete/append) should always only concern a single database, however, and as such not pose much of a problem.

@xhochy
Copy link
Contributor

xhochy commented Oct 9, 2019

@Mytherin That is a very close description of Ibis, definitely there are slight differences in the API but Ibis is also a dplyr-like frontend to multiple databases that builds up an expression tree. You should definitely check it out before replicating it.

@hannes
Copy link
Member

hannes commented Oct 9, 2019

Indeed we should check if we could extend ibis to do what we envision.

@Mytherin
Copy link
Collaborator Author

Mytherin commented Oct 9, 2019

I would say we should not limit this interface to Python, we can create the interface in C++ and then create a Python layer on top of it. At least for the C++ interface I suggest we implement the proposed interface, for Python we can consider building the Ibis interface on top of the proposed interface in C++.

Looking at Ibis I can see that it has some similarities but a bunch of differences as well that I don't really like, e.g.

  • table.group_by(keys).aggregate([metric]): The grouping and aggregate being separate introduces a lot of problems from a user perspective because the group_by cannot really return a relation (at least in Pandas) but instead returns a special "grouping object". This is problematic from an interface design perspective and not so clean imo. Having the aggregate and grouping be together makes this much cleaner as the aggregate function simply returns a relation just like all the other operators.
  • table[table.bigint_col > 50]: This kind of design seems clean for simple cases but gets pretty ugly when you want to combine multiple conditions imo. A simple condition like i > 50 AND i < 100 now becomes table[(table.i > 50) & (table.i < 100)], with required brackets because of operator precedence, and you get very weird errors if you instead do the more logical table[table.i > 50 & table.i < 100] or table[table.i > 50 and table.i < 100]. Doing like filter('i > 50 AND i < 100') seems cleaner and more user-friendly to me. We can use prepared statements to allow for variables, e.g. filter('i > $1 AND i < $2', 50, 100). This is also consistent with the SQL interface.
  • Having all these different functions to achieve SQL-like behavior also introduces a large unnecessary learning curve to people that are familiar with SQL already, e.g. look at this snippet from the tutorial:
cond1 = table.bigint_col > 50
cond2 = table.int_col.between(2, 7)
table[cond1 | cond2].count()

I would argue this is cleaner and more readable:

table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').aggregate('COUNT(*)')
# or if we make a count(*) shortcut, which seems sensible:
table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').count()

I'm not entirely opposed to implementing the ibis like interface as I know people in the Python community are used to it, but I think the interface proposed here is cleaner and more user-friendly, especially for people familiar with SQL. Having worked with Pandas (which has a very similar interface) I think it has a number of design flaws that have a number of very surprising behaviors. The proposed interface also maps very cleanly to the implementation of our database, which makes it easy to implement and maintain.

@hannes
Copy link
Member

hannes commented Nov 4, 2019

Also interesting is a programmatic way to construct CREATE TABLE statements

@hannes
Copy link
Member

hannes commented Apr 23, 2020

This is being worked on in the pyrel branch

@hannes hannes closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants