-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved DPlyr-Like Python Interface #302
Comments
I think this is a great idea. I am not sure about the res = read_csv.filter('C1 > 100 AND C1 < 120').aggr("COUNT(*)", groups="C2") Syntax. The reason why dplyr has the |
The Python equivalent to |
Implementation IdeasEvery node returns a The following operators seem necessary (perhaps there are some I am forgetting): Sources
Manipulation
Modifying DataAll of the modify operations directly trigger execution and do not return a relation.
Result Visualization/Conversion
More Examplesimport duckdb
# table: integers(i INTEGER)
# query: SELECT SUM(i)+2*COUNT(*) FROM integers WHERE i>10
db = duckdb.connect('test.db')
# produces a relation -> names: ['i'] types: [INTEGER]
tbl = db.table('integers')
# returns a relation with the same names/types -> names: ['i'] types: [INTEGER]
tbl = tbl.filter('i > 10')
# aggr without groups, optional aliases can be provided
aggr = tbl.aggr(['SUM(i)', 'COUNT(*)'], names=["sum", "count"])
# projection now can only refer to the aggr columns
proj = aggr.project('sum * 2+count')
proj.print() For lists, I would argue we allow you to either provide a Python list, or to perform the comma delimination inside the string itself, i.e. the following are identical: tbl.project(['i', 'i+2'])
tbl.project('i, i+2') Actual ImplementationAll of these operators map very naturally to our individual LogicalOperators, and by creating a logical plan we can run all our optimizers prior to execution. The composability where everything returns a Relation seems very clean and clear to me as well. The main implementation difficulty here is I think parsing the expressions and binding them, however, this should not be very complicated either. The expressions can be parsed using our SQL parser, and then column references can be resolved by looking at the input relation and turned into BoundColumnReferences. Other operators/functions will be similarly bound. Catalog lookups happen only in the Functions are a concern as well, as they live in the catalog of a database as well, however we can simply resort to only resolving built-in functions here instead of performing a lookup in the DB catalog. Updates/DeletionsUpdates and deletions rely on row identifiers to figure out which rows to modify. As such, they must be bound to a base table. The following operations are legal: # change entire table
db.table('integers').delete()
db.table('integers').update('i=NULL')
# change subset of table
db.table('integers').filter('i>10').delete()
# limit/order by can also be used
db.table('integers').filter('i>10').order('i DESC').limit(10).delete() However, any operation that produces a new Relation ( Multiple DatabasesThis interface also allows easily combining tables from different databases, although some care must be taken w.r.t. transactions here as operating on multiple databases at once also means having multiple open transactions. Modifications (update/delete/append) should always only concern a single database, however, and as such not pose much of a problem. |
@Mytherin That is a very close description of Ibis, definitely there are slight differences in the API but Ibis is also a dplyr-like frontend to multiple databases that builds up an expression tree. You should definitely check it out before replicating it. |
Indeed we should check if we could extend ibis to do what we envision. |
I would say we should not limit this interface to Python, we can create the interface in C++ and then create a Python layer on top of it. At least for the C++ interface I suggest we implement the proposed interface, for Python we can consider building the Ibis interface on top of the proposed interface in C++. Looking at Ibis I can see that it has some similarities but a bunch of differences as well that I don't really like, e.g.
cond1 = table.bigint_col > 50
cond2 = table.int_col.between(2, 7)
table[cond1 | cond2].count() I would argue this is cleaner and more readable: table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').aggregate('COUNT(*)')
# or if we make a count(*) shortcut, which seems sensible:
table.filter('bigint_col > 50 OR int_col BETWEEN 2 AND 7').count() I'm not entirely opposed to implementing the ibis like interface as I know people in the Python community are used to it, but I think the interface proposed here is cleaner and more user-friendly, especially for people familiar with SQL. Having worked with Pandas (which has a very similar interface) I think it has a number of design flaws that have a number of very surprising behaviors. The proposed interface also maps very cleanly to the implementation of our database, which makes it easy to implement and maintain. |
Also interesting is a programmatic way to construct |
This is being worked on in the |
For Python, we want to implement a dplyr-like interface so users do not need to use SQL to express their queries. Instead, they can manipulate the data using operations that will be converted into DuckDB physical operators and executed using the DuckDB query engine. Operators can be chained using a dplyr-like syntax (e.g. by overloading the '>>' operator).
Expressions are still modeled as strings, as this allows for easy combination of multiple expressions. They are parsed as SQL expressions, e.g. by prefixing them with 'SELECT ', throwing them in the parser and then extracting the operations again.
Example 1: read from CSV
Example 2: use persistent storage
Example 3: update/delete
The text was updated successfully, but these errors were encountered: