Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add scalar UDF #6862

Closed
Tishj opened this issue Mar 25, 2023 · 10 comments · Fixed by #7171
Closed

[Python] Add scalar UDF #6862

Tishj opened this issue Mar 25, 2023 · 10 comments · Fixed by #7171

Comments

@Tishj
Copy link
Contributor

Tishj commented Mar 25, 2023

We want to add the ability to register UDFs in the python client.

Proposed syntax:

import duckdb

from duckdb.typing import *

def add_one(x):
    return x + 1

duckdb.register_scalar("plus_one", add_one, [BIGINT], BIGINT)
# duckdb.register_scalar("plus_one", lambda x: x + 1, [BIGINT], BIGINT)

res = duckdb.sql('select plus_one(5)')
@Tishj
Copy link
Contributor Author

Tishj commented Mar 25, 2023

@gforsyth what do you think of this syntax?

@Alex-Monahan
Copy link
Contributor

Here is sqlite3's syntax as another reference point! I have not used it though...
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function

@Tishj
Copy link
Contributor Author

Tishj commented Mar 25, 2023

Yea I need to look at the binding code for varargs, we support it for other scalar functions, just not familiar with the internals so I can't envision it. But we can probably support varargs.

I also had some ideas for extra options, null-handling is one, and exception handling (return null or rethrow?)

@cpcloud
Copy link
Contributor

cpcloud commented Mar 27, 2023

The sqlite3 module's syntax is pretty nice, I'd vote for a slightly modified version of that that doesn't require passing in the number of arguments, since that can be extracted from the Python function object.

@Tishj
Copy link
Contributor Author

Tishj commented Mar 27, 2023

@cpcloud We can definitely support varargs, in which case it's not required to provide the arguments explicitly.
I'd like to leave the arguments list as a mandatory parameter, but if varargs=True is set, you can leave it empty [] or None if you want

@jcrist
Copy link

jcrist commented Mar 27, 2023

In this implementation, what object types are passed to the function when called? Scalars (an int in this case)? Or is the function "vectorized", where a pyarrow/pandas/numpy thing is passed in instead? For efficiency reasons, I'd prefer the latter (probably with pyarrow things) if possible.

@cpcloud
Copy link
Contributor

cpcloud commented Mar 27, 2023

For efficiency reasons, I'd prefer the latter (probably with pyarrow things) if possible.

+1 from me on that too. I'd strongly prefer pyarrow objects, since they can easily be converted to pandas objects with a single method call.

@Tishj
Copy link
Contributor Author

Tishj commented Mar 27, 2023

This first implementation will not be, the idea here is that when you need a UDF, you likely don't need to do operations that can efficiently be done on pyarrow/pandas/numpy, as those are all database operations - which could just as easily be done with SQL.
The aim here is to support arbitrary transformations

But we could definitely add something like a vectorized_udf in the future that works with pyarrow

@jcrist
Copy link

jcrist commented Mar 27, 2023

you likely don't need to do operations that can efficiently be done on pyarrow/pandas/numpy, as those are all database operations

I disagree with this. In python there are many things that operate on those containers that aren't necessarily database operations. For our use case, we'd often want to be hooking in functionality that's already inherently vectorized, but not expressible using duckdb native operations. For example, calling model.predict on a column using a scikit-learn model, or calling a function from scipy.special (or some other custom ufunc). If a user needs scalar level calls, those can always be constructed out of vectorized/batched calls, but going the other way isn't possible.

@Tishj
Copy link
Contributor Author

Tishj commented Mar 27, 2023

That makes sense, I'll adjust the focus to the vectorized pyarrow-backed version instead then 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants