Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 16, 2023

Which issue does this PR close?

Close #5293

Rationale for this change

resolves #5293

Basically the check added in #5132 and #5258 is overly strict in some cases

What changes are included in this PR?

  1. I moved the check for causing wrong results with Distinct into the LogicalPlanBuilder at the site where the problem is introduced (where new columns are added) rather than trying to figure it out beforehand.
  2. Add comments explaining what is happening
  3. Add a special case to handle the query in Error with query that has DISTINCT with ORDER BY and aliased select list #5293

Are these changes tested?

Yes, new unit tests as well as integration tests

Are there any user-facing changes?

Some queries now pass

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Feb 16, 2023
.distinct()
.unwrap()
// try to sort on some value not present in input to distinct
.sort(vec![col("c2").sort(true, true)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a newly added test in #5258

I think the answer is wrong in this case - in particular you can see there are duplicate values of c1 produced.


let df_results = plan.clone().collect().await?;

#[rustfmt::skip]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a newly added test in #5258

If you look at the previous answers, it appears clearly incorrect to me -- there are duplicate values of c1 produced. I updated the test to use c1 and avoid an error as well as added a test with sorting by c2 that shows the error

vec![
Arc::new(Int32Array::from_slice([1, 10, 10, 100])),
Arc::new(Int32Array::from_slice([2, 12, 12, 120])),
Arc::new(Int32Array::from_slice([2, 3, 4, 5])),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a newly added test -- and since the values of (a, b) are distinct where (a) was, it didn't show wrong results. If you change this data, prior to this PR it does show incorrect results

I fixed the sort to be on a valid column and added a test for incorrect columns

@alamb alamb marked this pull request as ready for review February 16, 2023 17:01
@alamb
Copy link
Contributor Author

alamb commented Feb 16, 2023

cc @xiaoyong-z and @liukun4515

@alamb alamb requested review from jackwener and tustvold February 17, 2023 19:57
@alamb
Copy link
Contributor Author

alamb commented Feb 17, 2023

I would appreciate a review on this sooner rather than later because #5293 is effectively a regression for our users. Do you have time to take a look @xiaoyong-z ?

@xiaoyong-z
Copy link
Contributor

sorry @alamb I'm busy with my own things recently, and I don't think I'll have free time in the short term. I'm very sorry about this.

@alamb
Copy link
Contributor Author

alamb commented Feb 18, 2023

Thank you for letting me know @xiaoyong-z

Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @alamb. Only notable edit is a println! in the sort function.

@alamb alamb merged commit 554852e into apache:main Feb 22, 2023
@alamb alamb deleted the alamb/oby_issue branch February 22, 2023 12:34
@ursabot
Copy link

ursabot commented Feb 22, 2023

Benchmark runs are scheduled for baseline = a853123 and contender = 554852e. 554852e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jiangzhx pushed a commit to jiangzhx/arrow-datafusion that referenced this pull request Feb 24, 2023
* Allow DISTINCT with ORDER BY and an aliased select list

* fix: update tests

* Update datafusion/core/tests/sqllogictests/test_files/order.slt

Co-authored-by: Stuart Carnie <stuart.carnie@gmail.com>

* Update datafusion/expr/src/logical_plan/builder.rs

Co-authored-by: Stuart Carnie <stuart.carnie@gmail.com>

* update test

---------

Co-authored-by: Stuart Carnie <stuart.carnie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with query that has DISTINCT with ORDER BY and aliased select list

4 participants