-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow User Defined Aggregates to return multiple values / structs #600
Comments
When I implemented todays' aggs, I Initially though about multiple return values, and then concluded that the Struct is sufficient and desirable. What I like about the struct is that it enables named fields, which imo makes the statements rather expressive. E.g.
vs
the context "min" and "max" imo helps the user at reading what they are extracting from the column. Would supporting structs for ScalarValues solve this nicely? |
I think so @jorgecarleitao - what is also needed is some way in SQL to refer to the structs So like if the UDA could return a struct with fields select t.min, t.max from (select my_udagg(col1, col2) from my_table) as t or something |
it makes a lot of sense. Would you be ok for you if we add two issues, one for supporting structs on ScalarValues, the other for supporting accessing struct fields by name on SQL as a replacement for this one? I can work on both of them over this weekend. |
Sure that would be great! Would you like me to file them?
Thank you! |
@jacobmarble I suspect a UDA could return a struct type now and the support for structs in general in DataFusion is better (though far from complete). I think someone would need to try implementing such a UDF and see if it worked |
An update here is that I think DataFusion now supports this correctly, which is pretty neat. I wrote an example and I will propose adding that as a test to show how this works |
Test demonstrating the functionality works: #3425 |
Usecase
I want to implement a user defined aggregate function that produces more than one column ( logical values)
Specifically I am trying to implement the InfluxDB 'selector' functions
first
,last
,min
, andmax
as DataFusion aggregate functions.I can't use the built in aggregate functions in DataFusion as selector functions aren't exactly like normal aggregate functions – they return both the actual aggregate value as well as a timestamp. In addition,
first
andlast
pick a row in the value column based on the value in the timestamp column.After some investigation, I realize I can't elegantly use the built in user defined aggregate framework in DataFusion either. As an example of what is going on here, let's take
The result of
last(value)
should be be two columns1 | 3000
– however, modeling this as a DataFusion aggregate does not seem to be possible at this time. Each aggregate function can return a single columnar value but we need to return 2 (the.value
and.time
fields).See additional detail and context on https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824
Describe the solution you'd like
Ideally I was thinking that the UDF could produce a Struct (with named field
value
andtime
) but the evaluate function(code returns aScalarValue
and at the moment they don't have support for StructsI suspect that we would also need to add support in DataFusion for selecting fields from structs
Additional context
Ported from original JIRA: https://issues.apache.org/jira/browse/ARROW-10945
The text was updated successfully, but these errors were encountered: