Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select multiple columns in a single Expr #10102

Open
universalmind303 opened this issue Apr 16, 2024 · 22 comments · May be fixed by #10222
Open

select multiple columns in a single Expr #10102

universalmind303 opened this issue Apr 16, 2024 · 22 comments · May be fixed by #10222
Labels
enhancement New feature or request

Comments

@universalmind303
Copy link
Contributor

Is your feature request related to a problem or challenge?

I want to create a udf that can select multiple functions at once, such as a COLUMNS(<regex>) function.

select COLUMNS('number\d+') from my_table.

Currently this is not possible due to the fact that udfs can only ever output a single Expr.

Describe the solution you'd like

Since it would be quite a massive overhaul to refactor all of the planning and udf logic to return Vec<Expr>, I propose adding a new variant to Expr.

Expr::Columns(Vec<Column>).

This seems like the least invasive way to support selecting multiple columns in a single expr.

Describe alternatives you've considered

I'm open for alternatives, but I am not aware of any.

Additional context

Polars has this variant in their Expr

https://github.com/pola-rs/polars/blob/9fec2ecb6d4295969e1d155b386ee82db08745a1/crates/polars-plan/src/dsl/expr.rs#L72

@universalmind303 universalmind303 added the enhancement New feature or request label Apr 16, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 19, 2024

It looks good to me, and we can gradually deprecate single column in the long-term

@edmondop
Copy link
Contributor

@universalmind303 are you already working on this?

@universalmind303
Copy link
Contributor Author

@universalmind303 are you already working on this?

no, feel free to go ahead and work on it if you'd like.

@edmondop
Copy link
Contributor

This would require also modifying the protobuf definitions, are we ok with it?

@alamb
Copy link
Contributor

alamb commented Apr 23, 2024

How about returning DataType::Struct or DataType::List or DataType::FixedSizeList?

@edmondop
Copy link
Contributor

My current wip has a struct since it supports different data types and field names, can a List work as well?

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

My current wip has a struct since it supports different data types and field names, can a List work as well?

No, you are right, a List can't really support multiple different types

@edmondop edmondop linked a pull request Apr 24, 2024 that will close this issue
@edmondop
Copy link
Contributor

@universalmind303 I think it would be interesting to test it, do you have an example / have you written the UDF?

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

I wonder if struct(col1, col2) or named_struct('a', col1, 'b', col2) would work for your usecase 🤔

@edmondop
Copy link
Contributor

> select struct(column1, column2) from abc;
+---------------------------------+
| struct(abc.column1,abc.column2) |
+---------------------------------+
| {c0: 1, c1: 2}                  |
| {c0: 4, c1: 5}                  |
+---------------------------------+
2 row(s) fetched. 
Elapsed 0.004 seconds.

that works, but isn't this unrelated to this change? I suppose it was working before too...

@edmondop
Copy link
Contributor

@alamb by the way, how does the struct udf works? I am not sure I understand how this select only a subset of the fields, does Datafusion drop from the resulting dataset the fields that are not in the return_type ? https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/struct.rs#L97

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

@alamb by the way, how does the struct udf works?

I think https://datafusion.apache.org/user-guide/sql/scalar_functions.html#struct has some pretty good examples

@edmondop
Copy link
Contributor

Apologies I wasn't precise in my question. The code of the struct udf doesn't explicitly drop fields but returns all of them.

How does the selection happens ?

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field.

However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to
🤔

@edmondop
Copy link
Contributor

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field.

However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to 🤔

What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?

@universalmind303
Copy link
Contributor Author

FWIW, #10102 seems related as multi selection could be implemented via struct and unnest.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 8, 2024

How does the selection happens ?

I think you can select with something like my_struct_col['foo'] which returns the 'foo' field.
However now re-reading this ticket, I think it would be possible to create a user defined function like COLUMN by Implemening return_type_from_exprs to 🤔

What I was trying to say is that looking at that UDF isn't clear for me how the other fields are dropped. It seems that Datafusion performs an intersection between the field names returned by the return type and the ones returned by array_struct ?

@edmondop
By default the column name is named c0, c1 ... which defined in array_struct, so in return_type we also find the associated column with name c0, c1 ...

how the other fields are dropped

I don't think we drop any field in struct.

One of the example can be

statement ok
create table t(a1 int, a2 float) as values(1, 1.2), (2, 2.4);

query IR
select a1, a2 from t;
----
1 1.2
2 2.4

// make this works
query IR
select columns(a1, a2) from t;
----
1 1.2
2 2.4

@edmondop
Copy link
Contributor

edmondop commented May 16, 2024

I want to create a udf that can select multiple functions at once, such as a COLUMNS() function.

select COLUMNS('number\d+') from my_table.

looking at the struct UDF, it seems that it only receives the columns that are passed to function invocation and doesn't have access to other columns, i.e. in the struct udf, all columns are returned as a struct, which suggests that the engine only pass a subset of the available columns

However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function. In my understanding neither udf nor udaf or udwf from what I have seen. We would need maybe a fourth type of udf something like udgf which is invoked on all the columns of a logical plan?

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 16, 2024

However, in the case of COLUMNS('number\d+'), you need to have all the columns, and only return few of them from the function

I agree, we can't get all the columns by the current design of function, it is quite challenging than I thought 🤔. We expect to build up a projection plan given the syntax columns(..). The function mentioned here is similar to unnest, it is more like a table function. Therefore, I think we need a user-defined table function that we can define the returned Expr given the function we parsed in the parser.

The difference behaviour in parser (datafusion/sql) between these.

  • UDF/UDAF/UDWF: returns Expr::ScalarFunction / Expr::AggregateFunction / Expr::WindowFunction respectively
  • Unnest: returns Expr::Unnest
  • Columns(regex): returns Expr::Columns(Vec<Column>)
  • UDTF (user-defined table function): returns Expr

To make the mentioned function possible, we don;t even need to introduce Expr::Columns, but returns Vec<Expr> for sql_to_expr.

SelectItem::UnnamedExpr(expr) => {
let expr = self.sql_to_expr(expr, plan.schema(), planner_context)?;
let col = normalize_col_with_schemas_and_ambiguity_check(
expr,
&[&[plan.schema()]],
&plan.using_columns()?,
)?;
Ok(vec![col])
}

@edmondop
Copy link
Contributor

@jayzhan211 that function already returns a Vec<Expr>, or am I missing something? Do we still need to have udtf?

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 17, 2024

@jayzhan211 that function already returns a Vec, or am I missing something? Do we still need to have udtf?

sql_to_expr returns Result<Expr>, we can extend it to Result<Vec<Expr>>. Which function are you mentioning?

@jayzhan211
Copy link
Contributor

I didn't find equivalent behavior in postgres. I'm not sure should we support this kind of returns subset of columns based on column name matching

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants