[FEATURE REQUEST] Register vectorized python UDFs for use in SQL statements #4797

NickCrews · 2022-09-24T19:09:25Z

NickCrews
Sep 24, 2022

Currently, you can create UDFs in python and use them using duckdb.from_df(df).map(my_udf).

I would like to be able to register my UDF with duckdb such that I can call it from within SQL statements, similar to how you can register pandas DataFrames, eg. con.register('my_table', df). This is also similar to how you can register custom python UDFs for use in pyspark.

I propose something like:

# external python library
import nicknames

nn = nicknames.NickNamer()

def _is_nickname_of(n1: str, n2: str) -> bool:
    return n1 in nn.nicknames_of(n2)

def is_nickname_of(name1: pd.Series, name2: pd.Series) -> pd.Series:
    # This is sort of a silly example because I'm just python iterating over
    # the values, but in a real use case I would have a library that takes and returns
    # pyarrow arrays, and then this could actually be fast.
    result = (_is_nickname_of(n1, n2) for n1, n2 in zip(name1, name2))
    return pd.Series(result, index=name1.index)

con.register('is_nickname_of', is_nickname_of)
con.execute("SELECT is_nickname_of('Rob', 'Robert')").fetchone()

The motivation for this is I am trying to use splink for fuzzy deduplication, and in that framework you specify comparison functions using SQL statements. I want to be able to compare using is_nickname_of(first_name_l, first_name_r) OR is_nickname_of(first_name_r, first_name_l) I can do this currently using splink's pyspark backend, but it would be great to be able to do it with duckDB as well so I could swap between backends for exploration (duckdb) and full runs (pyspark)

hannes · 2022-09-25T05:38:57Z

hannes
Sep 25, 2022
Maintainer

What you describe here are so-called scalar UDFs. They are called "Scalar" because they - at least conceptionally - apply to a single row of data in a single call and also return a single Scalar value as a result.

We have thought about Scalar Python UDFs at length already. Implementing those not immediately straightforward due to two main issues:

Function Call Overhead: Calling a function in Python is rather expensive. It will be slow to loop over rows of data calling the function for every single row/value.
Global Interpreter Lock: We cannot call into the Python interpreter from multiple threads and get a good speedup because of Python's GIL. Hence we cannot directly use parallelism to alleviate issue 1 a bit.

There are some ideas that we could implement Python UDFs 1 and 2, they each come with their own drawbacks.

UDF Servers: This is what APACHE Spark does for example. Start a couple of separate Python processes, ship the input data to those processes, run the UDFs there, and ship the results back. This is tricky, DuckDB would need to use sys calls or fork to create more Python processes, manage them somewhat (what happens if one dies etc), reload any Python packages in them, deal with serialisation of data, contention etc. The mechanics of an in-process database doing this are questionable to say the least.
JIT-Vectorized UDFs: We could use something like Numba to generate a native function (Just-In-Time compilation) that wraps the user-supplied function to run on a larger chunk of data at the same time, much like DuckDB's internal execution engine works. The advantage here is that we do not need a separate process, we would need to study if the Numba semantics exactly match the Python ones, and for which kind of functions the JIT would work for. Not so sure either.

There are some good news, we have already implemented table-valued UDFs for Python. These functions get a chunk of the input table passed as a Pandas data frame, can manipulate it in whatever way they want and are required to return a data frame again. Again, the UDF only sees a chunk of the input at a time. Here is an example

import duckdb
conn = duckdb.connect()
query_result = conn.query("SELECT * FROM generate_series(10)")
print(query_result)
map_result = query_result.map(lambda df: df['generate_series'].add(42).to_frame())
print(map_result)

Here, the table-valued UDF is the lambda in the call to .map, it just adds 42 to the series in the query result.

3 replies

NickCrews Sep 25, 2022
Author

Thank you for the great explanation. I had a vague understanding of the challenges of ser/deser, but I didn't know many of these details!

Sorry, my example was a bit misleading because my vector -> vector UDF then immediately goes and uses a scalar -> scalar helper function. Perhaps it was also confusing because it was annotated as pd.Series -> pd.Series, when perhaps it should have been pd.DataFrame -> pd.DataFrame.

I'm looking for exactly what is already implemented (pd.DataFrame -> pd.DataFrame), but just accessible from the SQL API. In the library I'm trying to use, I can't use the python my_df.map(my_func) syntax, I have to supply SQL commands.

At first glance of how map was implemented, it looks like a lot of this logic already is happening in the C++ layer, so I imagined it could work as

when you call con.register('is_nickname_of', is_nickname_of), then at the C++ layer you add an entry to a lookup table of type string -> PyObject *, where the PyObject is the opaque handle to the python function.
Later, in a SQL query, when you come across a function call to "is_nickname_of", then you can reference this lookup table, find the stored python function, and then reuse the existing logic. This is maybe similar to how macro lookup happens?

Does that clear things up?

hannes Sep 26, 2022
Maintainer

Yes we can totally make this work so you can use the function from SQL.

NickCrews Nov 15, 2022
Author

@hannes thank you, that would be excellent! I can help contribute at a mid level if you point me in the right direction. Maybe reopen #4795 so we can track progress there?

tobymao · 2022-10-02T22:23:34Z

tobymao
Oct 2, 2022

@hannes I think you're overthinking scalar UDF's and trying to make them performant. I would also really like scalar UDF's in duckdb and I don't care about performance, I just need them to work. So I hope it wouldn't be difficult to implement something simple to use and register.

My use case is mainly using Duckdb for CI / Unit testing / Compatibility. I think scalar UDF's would be valuable regardless of performance. Having them available but slow would be a great value add imo.

4 replies

hannes Dec 5, 2022
Maintainer

Thing is, ca. 5 mins after we add the slow ones there is another discussion thread in this forum about them being slow. But I would be happy to discuss this further. I propose joins the DuckDB Foundation so we can have a more structured discussion about such roadmap issues.

tobymao Dec 5, 2022

Where is the DuckDB Foundation?

Mytherin Dec 5, 2022
Maintainer

You can find information about the DuckDB foundation here

tobymao Dec 5, 2022

Ah, I no longer work at a company that can afford to be in the DuckDB foundation.

tobymao · 2022-12-05T17:22:57Z

tobymao
Dec 5, 2022

hey @hannes is there any update on this feature request?

1 reply

hannes Dec 5, 2022
Maintainer

No.

Tishj · 2023-04-30T12:33:52Z

Tishj
Apr 30, 2023
Collaborator

I've got a branch here that can do this:

import duckdb

from duckdb.typing import *

con = duckdb.connect()

def table_func(df):
	print(df)
	return df

con.register_table_function('no_op_tablefunc', table_func)
rel2 = con.sql('select * from no_op_tablefunc((from range(10)))')
print(rel2)

Empty DataFrame
Columns: [range]
Index: []
Empty DataFrame
Columns: [range]
Index: []
   range
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
┌───────┐
│ range │
│ int64 │
├───────┤
│     0 │
│     1 │
│     2 │
│     3 │
│     4 │
│     5 │
│     6 │
│     7 │
│     8 │
│     9 │
└───────┘

This just utilizes the same code as map, only instead of using it on a relation you can register as a function to be used in SQL queries.
Note that like Hannes already mentioned, this is not optimized in any way yet

2 replies

NickCrews May 6, 2023
Author

Brilliant! Am I interpreting that in this case df is a pyarrow Table? I would prefer that over a pandas DF, it is more universal.

Tishj May 6, 2023
Collaborator

No not currently, it uses map which works with pandas dataframes

tekumara · 2023-12-30T07:19:38Z

tekumara
Dec 30, 2023

Scalar UDFs have been implemented in #7171

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] Register vectorized python UDFs for use in SQL statements #4797

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[FEATURE REQUEST] Register vectorized python UDFs for use in SQL statements #4797

NickCrews Sep 24, 2022

Replies: 5 comments · 10 replies

hannes Sep 25, 2022 Maintainer

NickCrews Sep 25, 2022 Author

hannes Sep 26, 2022 Maintainer

NickCrews Nov 15, 2022 Author

tobymao Oct 2, 2022

hannes Dec 5, 2022 Maintainer

tobymao Dec 5, 2022

Mytherin Dec 5, 2022 Maintainer

tobymao Dec 5, 2022

tobymao Dec 5, 2022

hannes Dec 5, 2022 Maintainer

Tishj Apr 30, 2023 Collaborator

NickCrews May 6, 2023 Author

Tishj May 6, 2023 Collaborator

tekumara Dec 30, 2023

NickCrews
Sep 24, 2022

Replies: 5 comments 10 replies

hannes
Sep 25, 2022
Maintainer

NickCrews Sep 25, 2022
Author

hannes Sep 26, 2022
Maintainer

NickCrews Nov 15, 2022
Author

tobymao
Oct 2, 2022

hannes Dec 5, 2022
Maintainer

Mytherin Dec 5, 2022
Maintainer

tobymao
Dec 5, 2022

hannes Dec 5, 2022
Maintainer

Tishj
Apr 30, 2023
Collaborator

NickCrews May 6, 2023
Author

Tishj May 6, 2023
Collaborator

tekumara
Dec 30, 2023