Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Aug 4, 2024

Which issue does this PR close?

Parts #11502

Rationale for this change

  • move aggregate_statistics to datafusion-physical-optimizer
  • leave the tests to datafusion/core/tests/physical_optimizer_integration.rs

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2024
@Weijun-H Weijun-H changed the title refactor: move aggregate statistics to datafusion-physical-optimizer refactor: move aggregate_statistics to datafusion-physical-optimizer Aug 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Weijun-H -- I think you are doing the hard work of figuring out the patterns as we start to break out the physical optimizers from the core

In this case, I think the challenge is where to put the tests for physical-optimizer passes that rely on code that is not available to the physical-optimizer crate (like the functions library or the parquet exec)

We had a similar challenge with the logical optimizer passes

I think the two main options are:

  1. Move the tests into "integration" tests in https://github.com/apache/datafusion/tree/main/datafusion/core/tests
  2. Rewrite the tests to avoid the dependencies (e.g. use function stubs and other things

I personally suggest we move the tests to https://github.com/apache/datafusion/tree/main/datafusion/core/tests so they retain the same level of coverage

The downside of this is that then the tests will not be in the same directory as code. I think we can manage this with some good comments / documentation (e.g. at the end of the physical optimizer module saying where the tests were

cc @lewiszlw @mustafasrepo @jayzhan211

[dependencies]
datafusion-common = { workspace = true, default-features = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Why is this needed?

If it is only for

use datafusion_expr::utils::COUNT_STAR_EXPANSION;

I think we could potentially move COUNT_STAR_EXPANSION to datafusion_common (and leave a backwards compatible pub use) and avoid this dependency

I think in general we are trying to keep the division between logical and physical exprs separate when possible -- this makes it easier for systems that only need physical plans (e.g. DataFusion Comet)

@@ -15,292 +15,19 @@
// specific language governing permissions and limitations
// under the License.

//! Utilizing exact statistics from sources to avoid scanning data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that putting these tests in datafusion/core/src/physical_optimizer/tests/aggregate_statistics.rs will make them harder to find becase:

  1. It is in a different crate than being tested
  2. It is in a non standard testing location

As I understand it, these tests use code in other modules that physical optimizer doesn't depend on (like functions-aggregate

Thus, I would like to suggest we put these tests https://github.com/apache/datafusion/tree/main/datafusion/core/tests

Perhaps something like physical_optimizer_integration.rs to mirror https://github.com/apache/datafusion/blob/main/datafusion/core/tests/optimizer_integration.rs

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Aug 5, 2024

/// The value to which `COUNT(*)` is expanded to in
/// `COUNT(<constant>)` expressions
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move COUNT_STAR_EXPANSION here

Comment on lines +411 to +418
/// Describe the type of aggregate being tested
pub enum TestAggregate {
/// Testing COUNT(*) type aggregates
CountStar,

/// Testing for COUNT(column) aggregate
ColumnA(Arc<Schema>),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum is used by aggregate_statistics and limited_distinct_aggregate, so move it for convenience.

@Weijun-H Weijun-H requested a review from alamb August 5, 2024 08:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H -- this looks great.

@alamb alamb merged commit fcd907d into apache:main Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants