Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: shouldn't pass alias through into subquery. #4141

Merged
merged 1 commit into from
Nov 15, 2022

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Nov 8, 2022

Which issue does this PR close?

Closes #4123.

Rationale for this change

What changes are included in this PR?

This PR has fixed a bug(We can see bug in UT and test case, but no one noticed it).

Original code passed the outside Alias into Subquery inside, it's wrong.

create table foo as select 1 as a;
select a from (select a from foo group by 1 order by a) as c;

Projection: c.a
  Projection: c.a, alias=c
    Projection: c.a
      Sort: foo.a ASC NULLS LAST
        Projection: foo.a, alias=c        <------- wrong, outside pass into subquery inside
          Aggregate: groupBy=[[foo.a]], aggr=[[]]
            TableScan: foo

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core datafusion crate sql labels Nov 8, 2022
@@ -778,7 +778,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
let normalized_alias = alias.as_ref().map(|a| normalize_ident(&a.name));
let logical_plan = self.query_to_plan_with_alias(
*subquery,
normalized_alias.clone(),
None,
Copy link
Member Author

@jackwener jackwener Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix bug here, because we shouldn't pass the alias into subquery inside.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the plans carefully -- thank you @jackwener

I can't quite figure out if this could cause an error / incorrect results but it certainly looks better

I am a little worried about the new column projection in datafusion/core/tests/sql/explain_analyze.rs but maybe that is a preexisting bug that is simply uncovered by this change.

@@ -746,7 +746,7 @@ async fn test_physical_plan_display_indent_multi_children() {
" ProjectionExec: expr=[c2@0 as c2]",
" ProjectionExec: expr=[c1@0 as c2]",
" RepartitionExec: partitioning=RoundRobinBatch(9000)",
" CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1]",
" CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1, c2]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me.

The query is only selecting c1 from the table -- so why is this scan now selecting c1 and c2?

    let sql = "SELECT c1 \
               FROM (select c1 from aggregate_test_100) AS a \
               JOIN\
               (select c1 as c2 from aggregate_test_100) AS b \
               ON c1=c2\
               ";

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing the alias c2 with the column c2, so I am concerned that this could also introduce correctness issues. It would be good to understand this better before we merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be caused by project_pushdown rule, I will look into it.

Comment on lines -1524 to -1525
" Projection: t3.t1_id, t3.t1_name, t3.t1_int, alias=t3 [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
" Projection: t1.t1_id, t1.t1_name, t1.t1_int, alias=t3 [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sure looks better to have not have this double alias

@@ -3170,10 +3170,10 @@ mod tests {
) AS a
) AS b";
let expected = "Projection: b.fn2, b.last_name\
\n Projection: b.fn2, b.last_name, b.birth_date, alias=b\
\n Projection: a.fn1 AS fn2, a.last_name, a.birth_date, alias=b\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah looking at these old plans now this selecting a just to alias them as b is a little strange

@andygrove
Copy link
Member

I took a first pass through this, and it seems to make sense, but I was confused by some of the expected results. I'll take another look tomorrow.

@andygrove andygrove self-requested a review November 9, 2022 20:01
@jackwener
Copy link
Member Author

jackwener commented Nov 11, 2022

@alamb @andygrove
I have find the reason of projection include [c1, c2].
It's because the pushdown project rule don't handle alias. In fact, when handle projection with alias, we need replace the alias column in new_required_columns, but if we see the code of this rule, we can find that it didn't consider alias.

For example:

project c1 as c2
    tablescan c1.

when we meet project c1 as c2, if new_required_columns include c2, we need replace it with c1.

It's a future ticket, I have debug this rule, I will try to fix it.

@jackwener
Copy link
Member Author

cc @mingmwang @liukun4515

@andygrove
Copy link
Member

@jackwener Thanks for tracking down the bug and filing an issue.

@liukun4515
Copy link
Contributor

Thanks @jackwener

@liukun4515 liukun4515 merged commit d91ce06 into apache:master Nov 15, 2022
@jackwener jackwener deleted the fix_alias branch November 15, 2022 04:19
@ursabot
Copy link

ursabot commented Nov 15, 2022

Benchmark runs are scheduled for baseline = 9de893e and contender = d91ce06. d91ce06 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in interpreting correctly parsed SQL with aliases
5 participants