Replies: 5 comments 10 replies
-
What you describe here are so-called scalar UDFs. They are called "Scalar" because they - at least conceptionally - apply to a single row of data in a single call and also return a single Scalar value as a result. We have thought about Scalar Python UDFs at length already. Implementing those not immediately straightforward due to two main issues:
There are some ideas that we could implement Python UDFs 1 and 2, they each come with their own drawbacks.
There are some good news, we have already implemented table-valued UDFs for Python. These functions get a chunk of the input table passed as a Pandas data frame, can manipulate it in whatever way they want and are required to return a data frame again. Again, the UDF only sees a chunk of the input at a time. Here is an example import duckdb
conn = duckdb.connect()
query_result = conn.query("SELECT * FROM generate_series(10)")
print(query_result)
map_result = query_result.map(lambda df: df['generate_series'].add(42).to_frame())
print(map_result) Here, the table-valued UDF is the lambda in the call to |
Beta Was this translation helpful? Give feedback.
-
@hannes I think you're overthinking scalar UDF's and trying to make them performant. I would also really like scalar UDF's in duckdb and I don't care about performance, I just need them to work. So I hope it wouldn't be difficult to implement something simple to use and register. My use case is mainly using Duckdb for CI / Unit testing / Compatibility. I think scalar UDF's would be valuable regardless of performance. Having them available but slow would be a great value add imo. |
Beta Was this translation helpful? Give feedback.
-
hey @hannes is there any update on this feature request? |
Beta Was this translation helpful? Give feedback.
-
I've got a branch here that can do this: import duckdb
from duckdb.typing import *
con = duckdb.connect()
def table_func(df):
print(df)
return df
con.register_table_function('no_op_tablefunc', table_func)
rel2 = con.sql('select * from no_op_tablefunc((from range(10)))')
print(rel2)
This just utilizes the same code as |
Beta Was this translation helpful? Give feedback.
-
Scalar UDFs have been implemented in #7171 |
Beta Was this translation helpful? Give feedback.
-
Currently, you can create UDFs in python and use them using
duckdb.from_df(df).map(my_udf)
.I would like to be able to register my UDF with duckdb such that I can call it from within SQL statements, similar to how you can register pandas DataFrames, eg.
con.register('my_table', df)
. This is also similar to how you can register custom python UDFs for use in pyspark.I propose something like:
The motivation for this is I am trying to use splink for fuzzy deduplication, and in that framework you specify comparison functions using SQL statements. I want to be able to compare using
is_nickname_of(first_name_l, first_name_r) OR is_nickname_of(first_name_r, first_name_l)
I can do this currently using splink's pyspark backend, but it would be great to be able to do it with duckDB as well so I could swap between backends for exploration (duckdb) and full runs (pyspark)Beta Was this translation helpful? Give feedback.
All reactions