Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow User Defined Aggregates to return multiple values / structs #600

Closed
alamb opened this issue Jun 22, 2021 · 9 comments · Fixed by #3425
Closed

Allow User Defined Aggregates to return multiple values / structs #600

alamb opened this issue Jun 22, 2021 · 9 comments · Fixed by #3425
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 22, 2021

Usecase

I want to implement a user defined aggregate function that produces more than one column ( logical values)

Specifically I am trying to implement the InfluxDB 'selector' functions first, last, min, and max as DataFusion aggregate functions.

I can't use the built in aggregate functions in DataFusion as selector functions aren't exactly like normal aggregate functions – they return both the actual aggregate value as well as a timestamp. In addition, first and last pick a row in the value column based on the value in the timestamp column.

After some investigation, I realize I can't elegantly use the built in user defined aggregate framework in DataFusion either. As an example of what is going on here, let's take

value | time
-----+-----
3 | 1000
2 | 2000
1 | 3000

The result of last(value) should be be two columns 1 | 3000 – however, modeling this as a DataFusion aggregate does not seem to be possible at this time. Each aggregate function can return a single columnar value but we need to return 2 (the .value and .time fields).

See additional detail and context on https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824
Describe the solution you'd like
Ideally I was thinking that the UDF could produce a Struct (with named field value and time) but the evaluate function(code returns a ScalarValue and at the moment they don't have support for Structs

I suspect that we would also need to add support in DataFusion for selecting fields from structs

Additional context
Ported from original JIRA: https://issues.apache.org/jira/browse/ARROW-10945

@alamb alamb added the enhancement New feature or request label Jun 22, 2021
@jorgecarleitao
Copy link
Member

When I implemented todays' aggs, I Initially though about multiple return values, and then concluded that the Struct is sufficient and desirable. What I like about the struct is that it enables named fields, which imo makes the statements rather expressive. E.g.

df = df.agg(udaf.call("a").alias("a"))
df.select(df["a"]["min"], df["a"]["max"])

vs

df = df.agg(udaf.call("a").alias("a"))
df.select(df["a"][0], df["a"][1])

the context "min" and "max" imo helps the user at reading what they are extracting from the column.

Would supporting structs for ScalarValues solve this nicely?

@alamb
Copy link
Contributor Author

alamb commented Jun 22, 2021

Would supporting structs for ScalarValues solve this nicely?

I think so @jorgecarleitao - what is also needed is some way in SQL to refer to the structs

So like if the UDA could return a struct with fields min and max I would want to be able to do something like

select t.min, t.max from (select my_udagg(col1, col2) from my_table) as t

or something

@jorgecarleitao
Copy link
Member

it makes a lot of sense.

Would you be ok for you if we add two issues, one for supporting structs on ScalarValues, the other for supporting accessing struct fields by name on SQL as a replacement for this one?

I can work on both of them over this weekend.

@alamb
Copy link
Contributor Author

alamb commented Jun 22, 2021

Would you be ok for you if we add two issues, one for supporting structs on ScalarValues, the other for supporting accessing struct fields by name on SQL as a replacement for this one?

Sure that would be great! Would you like me to file them?

I can work on both of them over this weekend.

Thank you!

@jorgecarleitao
Copy link
Member

Perfect. I created #602 and #603 . 👍

@jacobmarble
Copy link

@alamb the dependent issues ( #602 #603 #119 ) have all been resolved. Does that mean that UDA can return a struct type now, or does some plumbing remain?

@alamb
Copy link
Contributor Author

alamb commented Sep 1, 2022

@jacobmarble I suspect a UDA could return a struct type now and the support for structs in general in DataFusion is better (though far from complete). I think someone would need to try implementing such a UDF and see if it worked

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2022

An update here is that I think DataFusion now supports this correctly, which is pretty neat. I wrote an example and I will propose adding that as a test to show how this works

@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2022

Test demonstrating the functionality works: #3425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants