Skip to content

Conversation

@HuSen8891
Copy link
Contributor

@HuSen8891 HuSen8891 commented Jul 1, 2022

Which issue does this PR close?

Closes #2822

Rationale for this change

For logical plans, if upper offset is equal or greater than current's limit, replaces with an [LogicalPlan::EmptyRelation].

For this query: select * from (select * from t1 limit 2 offset 2 ) a order by a.column1 offset 2, we get
the Logical Plan before:

Limit: skip=2, fetch=None
-Sort: #a.column1 ASC NULLS LAST
--Projection: #a.column1
---Limit: skip=2, fetch=2
----Projection: #t1.column1, alias=a
-----TableScan: t1 projection=Some([column1])

And after:

Limit: skip=2, fetch=None
-Sort: #a.column1 ASC NULLS LAST
--Projection: #a.column1
---EmptyRelation

What changes are included in this PR?

add rules to eliminate multi limit-offset nodes to emptyRelation in datafusion/optimizer/src/eliminate_limit.rs

@github-actions github-actions bot added the optimizer Optimizer rules label Jul 1, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2022

Codecov Report

Merging #2823 (1616314) into master (5e26b13) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2823      +/-   ##
==========================================
- Coverage   85.25%   85.21%   -0.04%     
==========================================
  Files         274      275       +1     
  Lines       48762    48829      +67     
==========================================
+ Hits        41570    41608      +38     
- Misses       7192     7221      +29     
Impacted Files Coverage Δ
datafusion/optimizer/src/eliminate_limit.rs 100.00% <100.00%> (ø)
datafusion/data-access/src/object_store/mod.rs 0.00% <0.00%> (-100.00%) ⬇️
datafusion/data-access/src/object_store/local.rs 78.06% <0.00%> (-10.97%) ⬇️
datafusion/core/src/physical_plan/stream.rs 66.66% <0.00%> (-9.53%) ⬇️
datafusion/core/src/datasource/file_format/avro.rs 61.53% <0.00%> (-8.03%) ⬇️
datafusion/core/src/datasource/file_format/json.rs 93.75% <0.00%> (-5.13%) ⬇️
.../core/src/physical_plan/file_format/file_stream.rs 89.89% <0.00%> (-3.35%) ⬇️
datafusion/common/src/error.rs 80.00% <0.00%> (-2.28%) ⬇️
...afusion/core/src/physical_plan/file_format/json.rs 91.06% <0.00%> (-2.13%) ⬇️
...tafusion/core/src/physical_plan/file_format/csv.rs 92.22% <0.00%> (-1.56%) ⬇️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e26b13...1616314. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2022

cc @ming535 - would you like to review this PR?

@ming535
Copy link
Contributor

ming535 commented Jul 2, 2022

@AssHero Hi thanks for your work! I looked into this PR, and have some concerns on logical plan node like Join when this optimization is used and left some comments.

@HuSen8891
Copy link
Contributor Author

@AssHero Hi thanks for your work! I looked into this PR, and have some concerns on logical plan node like Join when this optimization is used and left some comments.

@ming535 Thanks!But I didn't see any comments here, please let me know if I miss something?


fn eliminate_limit(
_optimizer: &EliminateLimit,
upper_offset: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use ancestor_skip instead of upper_offset to make it consistent with code in limit_push_down

input,
..
}) => {
// If upper's offset is equal or greater than current's limit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If ancestor's offset/skip is equal or greater than current's fetch,
// replaces the current plan with an LogicalPlan::EmptyRelation.
// For example, in the following query, the subquery test_b should be optimized as an empty table: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b LIMIT 2 OFFSET 5;

None => {}
}

let offset = match offset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to use this let skip = skip.unwrap_or(0)

LogicalPlan::Limit(Limit {
skip: offset,
fetch: limit,
input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we should keep use skip and fetch instead of offset and limit to keep it consistent with code in limit_push_down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this, Thanks!

Self {}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add some comments in top this file to clarify the intention of this optimizer rule in line 18 and line 27.

//! Optimizer rule to replace...

_ => {
// For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use upper_offset;
// otherwise, use 0 instead.
let offset = match plan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this case: a Join has an input node that has limit 10, and the ancestor of Join has offset 11. If the optimizer is allowed to optimize in this situation, then the result might be wrong.

Copy link
Contributor Author

@HuSen8891 HuSen8891 Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For join plan, the upper_offset is set to 0, and the limit node for the join is not affected.

let offset = match plan {
LogicalPlan::Projection { .. } | LogicalPlan::Sort { .. } => upper_offset,
_ => 0, // join/agg/filter and any others...
};

@HuSen8891
Copy link
Contributor Author

Improvements according to the comments.

plan: &LogicalPlan,
_optimizer_config: &OptimizerConfig,
) -> Result<LogicalPlan> {
match plan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

match plan {
  LogicalPlan::Limit(Limit {skip, Some(fetch), input, ..}})...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit with only skip should be handled here.
For query: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;

what do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;
In the above query, the current plan is the one with only fetch (the subquery).

I thought when the current node is Limit with only skip, it should be handled the same as Projection, Sort.

Copy link
Contributor Author

@HuSen8891 HuSen8891 Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ancestor LIMIT node is OFFSET 5 for query "SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5".
For this case, I think we should use 5 as ancestor's skip to optimizer inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit with only skip should be handled here. For query: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;

what do you think about this?

I see, I misunderstood your code.

\n TableScan: test";
assert_optimized_plan_eq(&plan, expected);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, can we add more tests here to handle all combination of skip and fetch between ancestor and child?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll add more test cases.

@HuSen8891
Copy link
Contributor Author

Add more test cases.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AssHero for the code and @ming535 for the review -- I think the code looks good to me.

I wonder if you have seen queries that produce 0 rows much in practice (as in I wonder how common will this optimization kick in / be useful)

// select * from (select * from xxx limit 5) a limit 2 offset 5;
match fetch {
Some(fetch) => {
if *fetch == 0 || ancestor_skip >= *fetch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// otherwise, use NotRelevant instead.
let ancestor = match plan {
LogicalPlan::Projection { .. } | LogicalPlan::Sort { .. } => ancestor,
_ => &Ancestor::NotRelevant,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if Ancester::Unknown would make the intent clearer compared to Ancester::NotRelevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if Ancester::Unknown would make the intent clearer compared to Ancester::NotRelevant?

This is consistent with limit_push_down.

}

#[test]
fn limit_fetch_with_ancestor_limit_fetch() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 94646ac into apache:master Jul 6, 2022
@alamb
Copy link
Contributor

alamb commented Jul 6, 2022

Thanks again all!

@HuSen8891 HuSen8891 deleted the eliminate_limit branch July 11, 2022 02:00
gandronchik pushed a commit to cube-js/arrow-datafusion that referenced this pull request Aug 30, 2022
* eliminate multi limit-offset

* refine the code

* add more test cases
gandronchik pushed a commit to cube-js/arrow-datafusion that referenced this pull request Aug 31, 2022
* eliminate multi limit-offset

* refine the code

* add more test cases
gandronchik pushed a commit to cube-js/arrow-datafusion that referenced this pull request Sep 2, 2022
* eliminate multi limit-offset

* refine the code

* add more test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eliminate multi limit-offset nodes to EmptyRelation if possible

4 participants