Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Median to functions-aggregate and Introduce Numeric signature #10644

Merged
merged 7 commits into from
May 26, 2024

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added physical-expr Physical Expressions core Core datafusion crate labels May 24, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review May 24, 2024 13:25
@jayzhan211 jayzhan211 marked this pull request as draft May 24, 2024 13:25
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title Move Median to functions-aggregate Move Median to functions-aggregate and Introduce Numeric signature May 24, 2024
data_type,
distinct,
aliases: vec!["median".to_string()],
signature: Signature::numeric(1, Volatility::Immutable),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduce this signature that computes the final coerce types instead of generating tons of valid types with the possible same type. The numeric signature considers decimal types too, which is needed but not included in the Numeric types array.

@@ -39,6 +39,7 @@ path = "src/lib.rs"

[dependencies]
arrow = { workspace = true }
arrow-schema = { workspace = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for downcast_integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the issue that not re-exported in arrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, some macro has inevitable arrow-schema dependency apache/arrow-rs#5676.

@jayzhan211 jayzhan211 marked this pull request as ready for review May 24, 2024 14:31
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @jayzhan211

@@ -119,6 +119,8 @@ pub enum TypeSignature {
OneOf(Vec<TypeSignature>),
/// Specifies Signatures for array functions
ArraySignature(ArrayFunctionSignature),
/// Fixed number of arguments of numeric types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please document (or link to documentation) about what types are "numeric"? Is it https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric ?

@@ -39,6 +39,7 @@ path = "src/lib.rs"

[dependencies]
arrow = { workspace = true }
arrow-schema = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the issue that not re-exported in arrow?


/// MEDIAN aggregate expression. If using the non-distinct variation, then this uses a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment still holds and would be valuable to bring to the new Median UDF

Specifically that median uses substantial memory

@@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy {
)))
}
}
Expr::AggregateFunction(AggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this is now a general optimization (as in it will rewrite any distinct user defined aggregate as well). If so, that is quite cool. Is it correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a test (if not already done) showing that the distinct aggregate is rewritten? Or perhaps there is already a test for median 🤔

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
logical_plan
01)Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT t.c)
02)--Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]]
03)----Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows distinct is optimzed

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb alamb merged commit 26b44f4 into apache:main May 26, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented May 26, 2024

Thanks again @jayzhan211

jayzhan211 added a commit to jayzhan211/arrow-datafusion that referenced this pull request May 26, 2024
…pache#10644)

* introduce median udaf

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm agg median

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm old median

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* introduce numeric signature

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* address comment

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fix doc

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add proto roundtrip

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jun 13, 2024
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Jun 14, 2024
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1

`datafusion-common` also depends on `pyo3`, so they need to be upgraded together.

* feat: remove GetIndexField

datafusion replaced Expr::GetIndexField with a FieldAccessor trait.

Ref apache/datafusion#10568
Ref apache/datafusion#10769

* feat: update ScalarFunction

The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream.

Ref apache/datafusion#10325

* feat: incorporate upstream array_slice fixes

Fixes #670

* update ExectionPlan::children impl for DatasetExec

Ref apache/datafusion#10543

* update value_interval_daytime

Ref apache/arrow-rs#5769

* update regexp_replace and regexp_match

Fixes #677

* add gil-refs feature to pyo3

This silences pyo3's deprecation warnings for its new Bounds api.

It's the 1st step of the migration, and should be removed before merge.

Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021

* fix signature for octet_length

Ref apache/datafusion#10726

* update signature for covar_samp

AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by

Ref apache/datafusion#10545
Ref apache/datafusion#10492

* convert covar_pop to expr_fn api

Ref: https://github.com/apache/datafusion/pull/10418/files

* convert median to expr_fn api

Ref apache/datafusion#10644

* convert variance sample to UDF

Ref apache/datafusion#10667

* convert first_value and last_value to UDFs

Ref apache/datafusion#10648

* checkpointing with a few todos to fix remaining compile errors

* impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano

* convert sum aggregate function to UDF

* remove unnecessary clone on double reference

* apply cargo fmt

* remove duplicate allow-dead-code annotation

* update tpch examples for new pyarrow interval

Fixes #665

* marked q11 tpch example as expected fail

Ref #730

* add default stride of None back to array_slice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants