Combine Equivalence and Ordering equivalence to simplify state #8006

mustafasrepo · 2023-10-31T12:33:06Z

Which issue does this PR close?

Closes #.

Rationale for this change

In the existing code base we have two different methods to keep track of equivalences in the Arc<dyn PhysicalPlan>.
Which are equivalence_properties and ordering_equivalence_properties.

As a background EquivalenceProperties keeps track of equivalent columns such as a=b=c, e=f.
OrderingEquivalenceProperties alternative orderings that table satisfies. such [a ASC, b ASC], and [c ASC, d ASC].
OrderingEquivalenceProperties keeps track of constant expressions also (e.g expression that are known to have a constant value. This can arise after filter, join, etc.).

Inherently, this information is coupled, as an example
Assume that

existing table satisfies following orderings [a ASC, b ASC] and [c ASC, d ASC].
table have equivalent columns a=e.
It is known that f is constant.

If an operator requires ordering at its input [e ASC, f ASC, b ASC]. During the analysis for whether this requirement is satisfied by existing ordering, we need to consider all orderings, equivalences, and constants at the same time.

Following naive algorithm can be followed for this analysis (Please note that algorithm in this PR is more elaborate.)

Remove constant expressions from the requirement (This converts requirement [e ASC, f ASC, b ASC] to [e ASC, b ASC])
Rewrite requirement such that it uses only representative expression for each distinct equivalent group (This converts requirement [e ASC, b ASC] to [a ASC, b ASC]).
Apply same procedures above each of the orderings inside the OrderingEquivalences (This converts ordering [a ASC, b ASC] to [a ASC, b ASC] and [c ASC, d ASC] to [c ASC, d ASC] no change ).
Check whether normalized requirement [a ASC, b ASC] is satisfied by any of normalized orderings [a ASC, b ASC], [c ASC, d ASC].

As can be seen from the example above. Keeping track of these information separately, is a bit cumbersome.

Also for the user implementing new functionality is a bit hard, and existing APIs are a bit involved also. Such as ordering_satisfy, requirements_compatible, etc.

I think it is better to unify these information in a single struct so that

We can expose better, and more friendly APIs from struct.
Move utils, functions, to method calls
Keep the invariants in the state (not relying on correct arguments).
All of the implementations, algorithms resides in a single place, and logic is not scatterred in different files.

What changes are included in this PR?

This PR unifies EquivalenceProperties and OrderingEquivalenceProperties to single struct called EquivalenceProperties (equivalence now involves, exact equivalence, ordering equivalence, constants).
And all of the implementation that depend on this information is moved to method calls (such as projection, ordering_satisfy, etc.)

Additional tests to show that better plans can be produced with new design. (As an example ordering_satisfy no longer just depends on single ordering, which is output ordering. Bu considers all of the valid orderings for the table. This enables additional optimizations. See new tests under window.slt as an example)

Are these changes tested?

Yes new tests are added.

Are there any user-facing changes?

api change ordering_equivalence_properties is removed from the ExecutionPlan and now equivalence_properties contains additional information.

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_distribution.rs # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/replace_with_order_preserving_variants.rs # datafusion/core/src/physical_optimizer/utils.rs # datafusion/physical-expr/src/expressions/get_indexed_field.rs # datafusion/physical-plan/src/aggregates/mod.rs # datafusion/physical-plan/src/joins/utils.rs

alamb · 2023-11-02T14:19:18Z

Hi @alamb. Did you get any problems testing downstream with IOx? We have some follow-ons to this in the pipeline once this merges

I have two test failures downstream that I am looking into. I don't know if they are problems with this PR or something pre-existing that it exposed

Getting this PR reviewed / tested in my highest priority today. I will have an update in a few hours

alamb · 2023-11-02T16:04:05Z

One of the test failures internally looks like the following

The input looks like:

2023-11-02T15:58:06.601675Z TRACE log: Optimized physical plan by CombinePartialFinalAggregate:
OutputRequirementExec
  SortExec: expr=[time@1 ASC NULLS LAST]
    CoalescePartitionsExec
      ProjectionExec: expr=[cpu as iox::measurement, time@0 as time, (selector_last(sum_idle,time)@1).[value] as last, (selector_last(sum_system,time)@2).[value] as last_1]
        AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[selector_last(sum_idle,time), selector_last(sum_system,time)], ordering_mode=Sorted
          SortPreservingRepartitionExec: partitioning=Hash([time@0], 16), input_partitions=16, sort_exprs=time@0 ASC NULLS LAST
            AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@0, 0) as time], aggr=[selector_last(sum_idle,time), selector_last(sum_system,time)]
              RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
                SortExec: expr=[time@0 ASC NULLS LAST]
                  CoalescePartitionsExec
                    ProjectionExec: expr=[time@0 as time, SUM(cpu.usage_idle)@1 as sum_idle, SUM(cpu.usage_system)@2 as sum_system]
                      AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[SUM(cpu.usage_idle), SUM(cpu.usage_system)]
                        RepartitionExec: partitioning=Hash([time@0], 16), input_partitions=16
                          AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@0, 0) as time], aggr=[SUM(cpu.usage_idle), SUM(cpu.usage_system)]
                            RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
                              ProjectionExec: expr=[time@1 as time, usage_idle@2 as usage_idle, usage_system@3 as usage_system]
                                FilterExec: date_bin(10000000000, time@1, 0) <= 1698940686290451000 AND time@1 <= 1698940686290451000 AND cpu@0 = cpu-total
                                  ParquetExec: file_groups={1 group: [[2/8/0649f0e8b1abed092a356ec6181369fcf585431d1cc0694a0cc4ab45cf78b49d/0c5ac9b2-f6d4-4004-9036-15412da47647.parquet]]}, projection=[cpu, time, usage_idle, usage_system], predicate=date_bin(10000000000, time@2, 0) <= 1698940686290451000 AND time@2 <= 1698940686290451000 AND cpu@0 = cpu-total, pruning_predicate=time_min@0 <= 1698940686290451000 AND cpu_min@1 <= cpu-total AND cpu-total <= cpu_max@2

But then after EnforceSorting the SortPreservingMergeExec seems to have to sort exprs anymore:

2023-11-02T15:58:06.605925Z TRACE log: Optimized physical plan by EnforceSorting:
OutputRequirementExec
  SortPreservingMergeExec: [time@1 ASC NULLS LAST] 
    SortExec: expr=[time@1 ASC NULLS LAST]
      ProjectionExec: expr=[cpu as iox::measurement, time@0 as time, (selector_last(sum_idle,time)@1).[value] as last, (selector_last(sum_system,time)@2).[value] as last_1]
        AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[selector_last(sum_idle,time), selector_last(sum_system,time)]
    ----> SortPreservingRepartitionExec: partitioning=Hash([time@0], 16), input_partitions=16 
            AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@0, 0) as time], aggr=[selector_last(sum_idle,time), selector_last(sum_system,time)]
              RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=16
                ProjectionExec: expr=[time@0 as time, SUM(cpu.usage_idle)@1 as sum_idle, SUM(cpu.usage_system)@2 as sum_system]
                  AggregateExec: mode=FinalPartitioned, gby=[time@0 as time], aggr=[SUM(cpu.usage_idle), SUM(cpu.usage_system)]
                    RepartitionExec: partitioning=Hash([time@0], 16), input_partitions=16
                      AggregateExec: mode=Partial, gby=[date_bin(10000000000, time@0, 0) as time], aggr=[SUM(cpu.usage_idle), SUM(cpu.usage_system)]
                        RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1
                          ProjectionExec: expr=[time@1 as time, usage_idle@2 as usage_idle, usage_system@3 as usage_system]
                            FilterExec: date_bin(10000000000, time@1, 0) <= 1698940686290451000 AND time@1 <= 1698940686290451000 AND cpu@0 = cpu-total
                              ParquetExec: file_groups={1 group: [[2/8/0649f0e8b1abed092a356ec6181369fcf585431d1cc0694a0cc4ab45cf78b49d/0c5ac9b2-f6d4-4004-9036-15412da47647.parquet]]}, projection=[cpu, time, usage_idle, usage_system], predicate=date_bin(10000000000, time@2, 0) <= 1698940686290451000 AND time@2 <= 1698940686290451000 AND cpu@0 = cpu-total, pruning_predicate=time_min@0 <= 1698940686290451000 AND cpu_min@1 <= cpu-total AND cpu-total <= cpu_max@2

This then causes a failure during execution / streaming merge

Internal error: Sort expressions cannot be empty for streaming merge

alamb

Thank you again @mustafasrepo and @ozankabak -- I can't realistically review this entire PR in detail, but I did skim the whole thing and reviewed the test changes, comments, and parts about grouping.

I think there may be a bug with this implementation as I mentioned in my comment, However, given the size of this PR I think we should merge it to master and then debug / fix whatever issues we find as follow on PRs.

Thank you again -- I think this is a major improvement

alamb · 2023-11-02T16:19:38Z

datafusion/physical-expr/src/utils.rs

-/// For example two InListExpr can be considered to be equals no matter the order:
-///
-/// In('a','b','c') == In('c','b','a')
-pub fn expr_list_eq_any_order(


FWIW in this kind of change (moving functions to other modules and renaming them) is something that we could probably do as individual PRs that would be quick to review as they would be mostly mechanical.

That would help make it easier to find the parts of a PR such as this one that needed more careful review

Agreed, we sent most of the drive by cleanups/refactors we did as separate small PRs, but this must have slipped through the cracks.

alamb · 2023-11-02T16:29:23Z

datafusion/physical-plan/src/aggregates/order/mod.rs


 mod full;
 mod partial;

+use crate::windows::PartitionSearchMode;


This change effectively removes the duplication between GroupByOrderMode and PartitionSearchMode that represent the same thing right?

I find PartitionSearchMode a confusing name as the term "partition" is pretty overloaded already (like each ExecutionPlan has input/output partitions, and WindowExec deals with partitions).

Also the fact it is in the windows module seems to be a mismatch given it is now used in the Aggregation logic

Maybe it could go into its own module 🤔 and be called SortOrderMode or something

Also, can we clarify what PartitionSearchMode::PartiallySorted means -- specifically, does it represent a prefix of the columns that is sorted?

This could all be done as follow on PRs

datafusion/physical-plan/src/aggregates/row_hash.rs

datafusion/physical-plan/src/coalesce_partitions.rs

alamb · 2023-11-02T16:39:58Z

datafusion/physical-plan/src/common.rs

@@ -373,6 +375,38 @@ pub fn batch_byte_size(batch: &RecordBatch) -> usize {
    batch.get_array_memory_size()
 }

+/// Constructs the mapping between a projection's input and output
+pub fn calculate_projection_mapping(


To make this easier to discover, perhaps it could be a method in ProjectionMapping such as ProjectionMapping::try_new

For now ProjectionMapping is just a type alias, but we plan to promote it to a full-fledged type in a follow on PR -- I agree that it makes sense to convert this into a method then.

FWIW I hacked up a version here and it worked great: #8033

datafusion/physical-plan/src/joins/nested_loop_join.rs

alamb · 2023-11-02T16:43:26Z

datafusion/physical-plan/src/projection.rs

+    /// The mapping used to normalize expressions like Partitioning and
+    /// PhysicalSortExpr. The key is the expression from the input schema
+    /// and the value is the expression from the output schema.
+    projection_mapping: Vec<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)>,


This is another pattern that has appeared more than once (also in GroupBy) -- maybe it could be ProjectionMapping (which could also be made into a struct w/ a constructor)

BTW I made a PR to do this here -- #8033 and it worked out well. Once this PR is merged I'll rebase it and get it ready for review

Here is a similar one for EquivalenceClass -- #8034 that might be more of a conflict magnet so leaving it for a while

datafusion/physical-plan/src/projection.rs

ozankabak · 2023-11-02T17:15:40Z

One of the test failures internally looks like the following

...

We will investigate tomorrow and include the fix in this PR if it is a quick fix. Thanks for battle-testing :)

# Conflicts: # datafusion/physical-plan/src/lib.rs

…xpression ordering among input ordering equivalences

alamb · 2023-11-03T15:14:57Z

~~FWIW I am running our internal tests against the code on this branch and they now pass.~~ Thank you @mustafasrepo

Update: I misran the test -- I am now running again

Update: the internal test still fails

alamb · 2023-11-03T15:46:28Z

BTW I still think give the size of this PR we should merge it now and then fix the issue next week. Follow on ticket to serve as a placeholder

#8043

mustafasrepo · 2023-11-03T15:54:28Z

BTW I still think give the size of this PR we should merge it now and then fix the issue next week. Follow on ticket to serve as a placeholder

#8043

That would be great. I actually understood the reason for the bug, and working on it. However, I also think that it is better to merge this PR without waiting.

alamb · 2023-11-03T16:03:16Z

Let's merge this to keep the code flowing

mustafasrepo · 2023-11-06T12:13:12Z

Partially resolves issue #8064

* Cleanup logical optimizer rules. (apache#7919) * Initial commit * Address todos * Update comments * Simplifications * Minor simplifications * Address reviews * Add TableScan constructor * Minor changes * make try_new_with_schema method of Aggregate private * Use projection try_new instead of try_new_schema * Simplifications, add comment * Review changes * Improve comments * Move get_wider_type to type_coercion module * Clean up type coercion file --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Parallelize Serialization of Columns within Parquet RowGroups (apache#7655) * merge main * fixes and cmt * review comments, tuning parameters, updating docs * cargo fmt * reduce default buffer size to 2 and update docs * feat: Use bloom filter when reading parquet to skip row groups (apache#7821) * feat: implement read bloom filter support * test: add unit test for read bloom filter * Simplify bloom filter application * test: add unit test for bloom filter with sql `in` * fix: imrpove bloom filter match express * fix: add more test for bloom filter * ci: rollback dependences * ci: merge main branch * fix: unit tests for bloom filter * ci: cargo clippy * ci: cargo clippy --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: don't push down volatile predicates in projection (apache#7909) * fix: don't push down volatile predicates in projection * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add suggestions * fix * fix doc * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Jonah Gao <jonahgaox@gmail.com> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Jonah Gao <jonahgaox@gmail.com> * Add `parquet` feature flag, enabled by default, and make parquet conditional (apache#7745) * Make parquet an option by adding multiple cfg attributes without significant code changes. * Extract parquet logic into submodule from execution::context * Extract parquet logic into submodule from datafusion_core::dataframe * Extract more logic into submodule from execution::context * Move tests from execution::context * Rename submodules * [MINOR]: Simplify enforce_distribution, minor changes (apache#7924) * Initial commit * Simplifications * Cleanup imports * Review --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Add simple window query to sqllogictest (apache#7928) * ci: upgrade node to version 20 (apache#7918) * Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` (apache#7844) * Change input for `to_timestamp` function * docs * fix examples * output `to_timestamp` signature as ns * Minor: Document `parquet` crate feature (apache#7927) * Minor: reduce some #cfg(feature = "parquet") (apache#7929) * Minor: reduce use of cfg(parquet) in tests (apache#7930) * Fix CI failures on `to_timestamp()` calls (apache#7941) * Change input for `to_timestamp` function * docs * fix examples * output `to_timestamp` signature as ns * Fix CI `to_timestamp()` failed * Update datafusion/expr/src/built_in_function.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix typo * fix --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * minor: add a datatype casting for the updated value (apache#7922) * minor: cast the updated value to the data type of target column * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Update datafusion/sqllogictest/test_files/update.slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix tests --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix (apache#7946) * Add simple exclude all columns test to sqllogictest (apache#7945) * Add simple exclude all columns test to sqllogictest * Add more exclude test cases * Support Partitioning Data by Dictionary Encoded String Array Types (apache#7896) * support dictionary encoded string columns for partition cols * remove debug prints * cargo fmt * generic dictionary cast and dict encoded test * updates from review * force retry checks * try checks again * Minor: Remove array() in array_expression (apache#7961) * remove array Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup others Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup cast Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup cast Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: simplify update code (apache#7943) * Add some initial content about creating logical plans (apache#7952) * Minor: Change from `&mut SessionContext` to `&SessionContext` in substrait (apache#7965) * Lower &mut SessionContext in substrait * rm mut ctx in tests * Fix crate READMEs (apache#7964) * Minor: Improve `HashJoinExec` documentation (apache#7953) * Minor: Improve `HashJoinExec` documentation * Apply suggestions from code review Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * chore: clean useless clone baesd on clippy (apache#7973) * Add README.md to `core`, `execution` and `physical-plan` crates (apache#7970) * Add README.md to `core`, `execution` and `physical-plan` crates * prettier * Update datafusion/physical-plan/README.md * Update datafusion/wasmtest/README.md --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> * Move source repartitioning into `ExecutionPlan::repartition` (apache#7936) * Move source repartitioning into ExecutionPlan::repartition * cleanup * update test * update test * refine docs * fix merge * minor: fix broken links in README.md (apache#7986) * minor: fix broken links in README.md * fix proto link * Minor: Upate the `sqllogictest` crate README (apache#7971) * Minor: Upate the sqllogictest crate README * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> --------- Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> * Improve MemoryCatalogProvider default impl block placement (apache#7975) * Fix `ScalarValue` handling of NULL values for ListArray (apache#7969) * Fix try_from_array data type for NULL value in ListArray * Fix * Explicitly assert the datatype * For review * Refactor of Ordering and Prunability Traversals and States (apache#7985) * simplify ExprOrdering * Comment improvements * Move map/transform comment up --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Keep output as scalar for scalar function if all inputs are scalar (apache#7967) * Keep output as scalar for scalar function if all inputs are scalar * Add end-to-end tests * Fix crate READMEs for core, execution, physical-plan (apache#7990) * Update sqlparser requirement from 0.38.0 to 0.39.0 (apache#7983) * chore: Update sqlparser requirement from 0.38.0 to 0.39.0 * support FILTER Aggregates * Fix panic in multiple distinct aggregates by fixing `ScalarValue::new_list` (apache#7989) * Fix panic in multiple distinct aggregates by fixing ScalarValue::new_list * Update datafusion/common/src/scalar.rs Co-authored-by: Daniël Heres <danielheres@gmail.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> * MemoryReservation exposes MemoryConsumer (apache#8000) ... as a getter method. * fix: generate logical plan for `UPDATE SET FROM` statement (apache#7984) * Create temporary files for reading or writing (apache#8005) * Create temporary files for reading or writing * nit * addr comment --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * doc: minor fix to SortExec::with_fetch comment (apache#8011) * Fix: dataframe_subquery example Optimizer rule `common_sub_expression_eliminate` failed (apache#8016) * Fix: Optimizer rule 'common_sub_expression_eliminate' failed * nit * nit * nit --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * Percent Decode URL Paths (apache#8009) (apache#8012) * Treat ListingTableUrl as URL-encoded (apache#8009) * Update lockfile * Review feedback * Minor: Extract common deps into workspace (apache#7982) * Improve datafusion-* * More common crates * Extract async-trait * Extract more * Fix cli --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * minor: change some plan_err to exec_err (apache#7996) * minor: change some plan_err to exec_err Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * change unreachable code to internal error Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Minor: error on unsupported RESPECT NULLs syntax (apache#7998) * Minor: error on unsupported RESPECT NULLs syntax * fix clippy * Update datafusion/sql/tests/sql_integration.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * GroupedHashAggregateStream breaks spill batch (apache#8004) ... into smaller chunks to decrease memory required for merging. * Minor: Add implementation examples to ExecutionPlan::execute (apache#8013) * Add implementation examples to ExecutionPlan::execute * Review feedback * address comment (apache#7993) Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * GroupedHashAggregateStream should register spillable consumer (apache#8002) * fix: single_distinct_aggretation_to_group_by fail (apache#7997) * fix: single_distinct_aggretation_to_group_by faile * fix * move test to groupby.slt * Read only enough bytes to infer Arrow IPC file schema via stream (apache#7962) * Read only enough bytes to infer Arrow IPC file schema via stream * Error checking for collect bytes func * Update datafusion/core/src/datasource/file_format/arrow.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: remove a strange char (apache#8030) * Minor: Improve documentation for Filter Pushdown (apache#8023) * Minor: Improve documentation for Fulter Pushdown * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: jakevin <jakevingoo@gmail.com> * Apply suggestions from code review * Update datafusion/optimizer/src/push_down_filter.rs Co-authored-by: Alex Huang <huangweijun1001@gmail.com> --------- Co-authored-by: jakevin <jakevingoo@gmail.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * Minor: Improve `ExecutionPlan` documentation (apache#8019) * Minor: Improve `ExecutionPlan` documentation * Add link to Partitioning * fix: clippy warnings from nightly rust 1.75 (apache#8025) Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Minor: Avoid recomputing compute_array_ndims in align_array_dimensions (apache#7963) * Refactor align_array_dimensions Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove unwrap Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix rebase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: fix doc check (apache#8037) * Minor: remove uncessary #cfg test (apache#8036) * Minor: remove uncessary #cfg test * fmt * Update datafusion/core/src/datasource/file_format/arrow.rs Co-authored-by: Ruihang Xia <waynestxia@gmail.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com> * Minor: Improve documentation for `PartitionStream` and `StreamingTableExec` (apache#8035) * Minor: Improve documentation for `PartitionStream` and `StreamingTableExec` * fmt * fmt * Combine Equivalence and Ordering equivalence to simplify state (apache#8006) * combine equivalence and ordering equivalence * Remove EquivalenceProperties struct * Minor changes * all tests pass * Refactor oeq * Simplifications * Resolve linter errors * Minor changes * Minor changes * Add new tests * Simplifications window mode selection * Simplifications * Use set_satisfy api * Use utils for aggregate * Minor changes * Minor changes * Minor changes * All tests pass * Simplifications * Simplifications * Minor changes * Simplifications * All tests pass, fix bug * Remove unnecessary code * Simplifications * Minor changes * Simplifications * Move oeq join to methods * Simplifications * Remove redundant code * Minor changes * Minor changes * Simplifications * Simplifications * Simplifications * Move window to util from method, simplifications * Simplifications * Propagate meet in the union * Simplifications * Minor changes, rename * Address berkay reviews * Simplifications * Add new buggy test * Add data test for sort requirement * Add experimental check * Add random test * Minor changes * Random test gives error * Fix missing test case * Minor changes * Minor changes * Simplifications * Minor changes * Add new test case * Minor changes * Address reviews * Minor changes * Increase coverage of random tests * Remove redundant code * Simplifications * Simplifications * Refactor on tests * Solving clippy errors * prune_lex improvements * Fix failing tests * Update get_finer and get_meet * Fix window lex ordering implementation * Buggy state * Do not use output ordering in the aggregate * Add union test * Update comment * Fix bug, when batch_size is small * Review Part 1 * Review Part 2 * Change union meet implementation * Update comments * Remove redundant check * Simplify project out_expr function * Remove Option<Vec<_>> API. * Do not use project_out_expr * Simplifications * Review Part 3 * Review Part 4 * Review Part 5 * Review Part 6 * Review Part 7 * Review Part 8 * Update comments * Add new unit tests, simplifications * Resolve linter errors * Simplify test codes * Review Part 9 * Add unit tests for remove_redundant entries * Simplifications * Review Part 10 * Fix test * Add new test case, fix implementation * Review Part 11 * Review Part 12 * Update comments * Review Part 13 * Review Part 14 * Review Part 15 * Review Part 16 * Review Part 17 * Review Part 18 * Review Part 19 * Review Part 20 * Review Part 21 * Review Part 22 * Review Part 23 * Review Part 24 * Do not construct idx and sort_expr unnecessarily, Update comments, Union meet single entry * Review Part 25 * Review Part 26 * Name Changes, comment updates * Review Part 27 * Add issue links * Address reviews * Fix failing test * Update comments * SortPreservingMerge, SortPreservingRepartition only preserves given expression ordering among input ordering equivalences --------- Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Encapsulate `ProjectionMapping` as a struct (apache#8033) * Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, ... (apache#8040) * Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, etc * prettier * Update docs/source/user-guide/sql/scalar_functions.md Co-authored-by: comphead <comphead@users.noreply.github.com> * Update docs/source/user-guide/sql/scalar_functions.md Co-authored-by: comphead <comphead@users.noreply.github.com> --------- Co-authored-by: comphead <comphead@users.noreply.github.com> * Improve comments for `PartitionSearchMode` struct (apache#8047) * Improve comments * Make comments partition/group agnostic * General approach for Array replace (apache#8050) * checkpoint Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * optimize non-list Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * replace list ver Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: Remove the irrelevant note from the Expression API doc (apache#8053) * Minor: Add more documentation about Partitioning (apache#8022) * Minor: Add more documentation about Partitioning * fix typo * Apply suggestions from code review Co-authored-by: comphead <comphead@users.noreply.github.com> * Add more diagrams, improve text * undo unintended changes * undo unintended changes * fix links * Try and clarify --------- Co-authored-by: comphead <comphead@users.noreply.github.com> * Minor: improve documentation for IsNotNull, DISTINCT, etc (apache#8052) * Minor: improve documentation for IsNotNull, DISTINCT, etc * fix * Prepare 33.0.0 Release (apache#8057) * changelog * update version * update changelog * Minor: improve error message by adding types to message (apache#8065) * Minor: improve error message * add test * Minor: Remove redundant BuiltinScalarFunction::supports_zero_argument() (apache#8059) * deprecate BuiltinScalarFunction::supports_zero_argument() * unify old supports_zero_argument() impl * Add example to ci (apache#8060) * feat: add example to ci * nit * addr comments --------- Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> * Update substrait requirement from 0.18.0 to 0.19.0 (apache#8076) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](substrait-io/substrait-rs@v0.18.0...v0.19.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix incorrect results in COUNT(*) queries with LIMIT (apache#8049) Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> * feat: Support determining extensions from names like `foo.parquet.snappy` as well as `foo.parquet` (apache#7972) * feat: read files based on the file extention * fix: some the file extension might be started with . and some not * fix: rename extention to extension * chore: use exec_err * chore: rename extention to extension * chore: rename extention to extension * chore: simplify the code * fix: check table is empty * ci: fix test * fix: add err info * refactor: extract the logic to infer_types * fix: add tests for different extensions * fix: ci clippy * fix: add more tests * fix: simplify the logic * fix: ci * Use FairSpillPool for TaskContext with spillable config (apache#8072) * Minor: Improve HashJoinStream docstrings (apache#8070) * Minor: Improve HashJoinStream docstrings * fix comments * Update datafusion/physical-plan/src/joins/hash_join.rs Co-authored-by: comphead <comphead@users.noreply.github.com> * Update datafusion/physical-plan/src/joins/hash_join.rs Co-authored-by: comphead <comphead@users.noreply.github.com> --------- Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: comphead <comphead@users.noreply.github.com> * Fixing broken link (apache#8085) * Fixing broken link * Update docs/source/contributor-guide/index.md Thanks for spotting this as well Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * fix: DataFusion suggests invalid functions (apache#8083) * fix: DataFusion suggests invalid functions * update test * Add test for BuiltInWindowFunction * Replace macro with function for `array_repeat` (apache#8071) * General array repeat Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * done Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fm Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor: remove unnecessary projection in `single_distinct_to_group_by` rule (apache#8061) * Minor: remove unnecessary projection * fix ci * minor: Remove duplicate version numbers for arrow, object_store, and parquet dependencies (apache#8095) * remove duplicate version numbers for arrow, object_store, and parquet dependencies * cargo update * use default features in parquet crate * disable default parquet features in wasmtest * fix: add match encode/decode scalar function type (apache#8089) * feat: Protobuf serde for Json file sink (apache#8062) * Protobuf serde for Json file sink * Fix tests * Fix test * Minor: use `Expr::alias` in a few places to make the code more concise (apache#8097) * Minor: Cleanup BuiltinScalarFunction::return_type() (apache#8088) * Expose metrics from FileSinkExec impl of ExecutionPlan --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Mustafa Akur <106137913+mustafasrepo@users.noreply.github.com> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> Co-authored-by: Devin D'Angelo <devinjdangelo@gmail.com> Co-authored-by: Hengfei Yang <hengfei.yang@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Huaijin <haohuaijin@gmail.com> Co-authored-by: Jonah Gao <jonahgaox@gmail.com> Co-authored-by: Chih Wang <ongchi@users.noreply.github.com> Co-authored-by: Jeffrey <22608443+Jefffrey@users.noreply.github.com> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: comphead <comphead@users.noreply.github.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Jay Zhan <jayzhan211@gmail.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: jakevin <jakevingoo@gmail.com> Co-authored-by: 张林伟 <lewiszlw520@gmail.com> Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Marko Milenković <milenkovicm@users.noreply.github.com> Co-authored-by: jokercurry <982458633@qq.com> Co-authored-by: zhongjingxiong <zhongjingxiong@bytedance.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com> Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mark Sirek <mwsirek@gmail.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Edmondo Porcu <edmondo.porcu@gmail.com> Co-authored-by: Syleechan <38198463+Syleechan@users.noreply.github.com> Co-authored-by: Dan Harris <dan@coralogix.com>

alamb · 2023-11-10T12:24:26Z

While working on the upgrade in IOx, I also filed #8120 to maybe help avoid issues for others upgrading

mustafasrepo added 30 commits September 22, 2023 13:27

combine equivalence and ordering equivalence

44ee85c

Remove EquivalenceProperties struct

56b92e1

Minor changes

620dcce

all tests pass

944029b

Refactor oeq

98f491b

Simplifications

f2a4600

Resolve linter errors

346e64e

Minor changes

9ca970b

Minor changes

722c2c2

Add new tests

f3f76d6

Simplifications window mode selection

9707b35

Simplifications

c4d7c99

Use set_satisfy api

aa7de00

Use utils for aggregate

6dbc85b

Minor changes

c1656c7

Minor changes

d77a6e0

Minor changes

fb08cce

All tests pass

8ce33be

Simplifications

e1b1488

Simplifications

aa019b3

Minor changes

055dc91

Simplifications

2f63ac7

All tests pass, fix bug

e383e46

Remove unnecessary code

85890c1

Simplifications

9c703bd

Minor changes

93b7dc2

Simplifications

79b72b9

Move oeq join to methods

c70ac05

Simplifications

f6f9d47

alamb approved these changes Nov 2, 2023

View reviewed changes

This was referenced Nov 2, 2023

Encapsulate ProjectionMapping as a struct #8033

Merged

Encapsulate EquivalenceClass into a struct #8034

Merged

mustafasrepo and others added 6 commits November 3, 2023 08:53

Merge branch 'apache_main' into refactor/oeq_eq_properties

2e52d40

# Conflicts: # datafusion/physical-plan/src/lib.rs

Address reviews

25f698d

Fix failing test

4de5540

Merge branch 'main' into refactor/oeq_eq_properties

f10c5c1

Update comments

b582cda

SortPreservingMerge, SortPreservingRepartition only preserves given e…

c7f0206

…xpression ordering among input ordering equivalences

alamb mentioned this pull request Nov 3, 2023

Internal error with repartitioning after equivalence consolidation #8043

Closed

alamb merged commit c2e7680 into apache:main Nov 3, 2023
22 checks passed

alamb mentioned this pull request Nov 3, 2023

Minor: Update quarterly_roadmap.md #8044

Closed

ozankabak mentioned this pull request Nov 4, 2023

Improve comments for PartitionSearchMode struct #8047

Merged

alamb mentioned this pull request Nov 6, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 6, 2023 #8063

Closed

6 tasks

mustafasrepo mentioned this pull request Nov 6, 2023

Improve Sort Based Optimizations #8064

Open

7 tasks

alamb mentioned this pull request Nov 7, 2023

[EPIC] Unify Function Interface (remove BuiltInScalarFunction) #8045

Closed

22 tasks

alamb mentioned this pull request Nov 10, 2023

Overlapping ExecutionPlan::maintains_input_order(), equivalence_properties and maintains_input_order are confusing #8120

Open

alamb mentioned this pull request Nov 13, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 13, 2023 #8151

Closed

9 tasks

alamb mentioned this pull request Nov 29, 2023

Move PartitionSearchMode into datafusion_physical_plan, rename to InputOrderMode #8364

Merged

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine Equivalence and Ordering equivalence to simplify state #8006

Combine Equivalence and Ordering equivalence to simplify state #8006

mustafasrepo commented Oct 31, 2023

alamb commented Nov 2, 2023

alamb commented Nov 2, 2023 •

edited

alamb left a comment

alamb Nov 2, 2023

ozankabak Nov 2, 2023

alamb Nov 2, 2023

alamb Nov 2, 2023

ozankabak Nov 2, 2023

alamb Nov 2, 2023

alamb Nov 2, 2023

alamb Nov 2, 2023

alamb Nov 2, 2023

ozankabak commented Nov 2, 2023

alamb commented Nov 3, 2023 •

edited

alamb commented Nov 3, 2023 •

edited

mustafasrepo commented Nov 3, 2023

alamb commented Nov 3, 2023

mustafasrepo commented Nov 6, 2023

alamb commented Nov 10, 2023

Combine Equivalence and Ordering equivalence to simplify state #8006

Combine Equivalence and Ordering equivalence to simplify state #8006

Conversation

mustafasrepo commented Oct 31, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Nov 2, 2023

alamb commented Nov 2, 2023 • edited

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Nov 2, 2023

alamb commented Nov 3, 2023 • edited

alamb commented Nov 3, 2023 • edited

mustafasrepo commented Nov 3, 2023

alamb commented Nov 3, 2023

mustafasrepo commented Nov 6, 2023

alamb commented Nov 10, 2023

alamb commented Nov 2, 2023 •

edited

alamb commented Nov 3, 2023 •

edited

alamb commented Nov 3, 2023 •

edited