Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6214] Remove DISTINCT in aggregate function if field is unique #3641

Merged

Conversation

JiajunBernoulli
Copy link
Contributor

@JiajunBernoulli JiajunBernoulli commented Jan 21, 2024

import java.util.List;

/**
* Planner rule that removes a distinct in count for {@link Aggregate}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julianhyde asked in JIRA to handle other aggregates too. Why just count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked it before.

Now we can handle other aggregates.

/** Test case for
* <a href="https://issues.apache.org/jira/browse/CALCITE-6214">[CALCITE-6214]
* Remove `DISTINCT` in `COUNT` if field is unique</a>. */
@Test void testAggregateDistinctRemove3() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you handle other aggregates you should probably add tests like SELECT COUNT(x), SUM(x) FROM SELECT DISTINCT ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now we can handle other aggregate functions after new commits.

@@ -6562,6 +6562,67 @@ private HepProgram getTransitiveProgram() {
.check();
}

/** Test case for
* <a href="https://issues.apache.org/jira/browse/CALCITE-6214">[CALCITE-6214]
* Remove `DISTINCT` in `COUNT` if field is unique</a>. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use backticks in jira summary or javadoc. They will be rendered as backticks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed them.

* <a href="https://issues.apache.org/jira/browse/CALCITE-6214">[CALCITE-6214]
* Remove `DISTINCT` in `COUNT` if field is unique</a>. */
@Test void testAggregateDistinctRemove2() {
final String sql = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add at least one test where the outer query has a GROUP BY? The following query should benefit from the simplification:

SELECT deptno, COUNT(DISTINCT sal), 
FROM (
   SELECT DISTINCT deptno, sal
   FROM emp)
GROUP BY deptno

Note that sal is not distinct but it is distinct for each deptno (because (deptno, sal) is a key).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it.

@JiajunBernoulli JiajunBernoulli changed the title [CALCITE-6214] Remove DISTINCT in COUNT if field is unique [CALCITE-6214] Remove DISTINCT in aggregate function if field is unique Jan 22, 2024
@JiajunBernoulli JiajunBernoulli force-pushed the remove-distinct-if-uniq branch 3 times, most recently from 9a9584e to a4419bb Compare January 28, 2024 09:50
Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question I have is whether this should be as you wrote it in RelBuilder, or it should be a rewrite rule in the optimizer. It is a matter of design choice rather than of correctness.

@@ -2525,6 +2529,29 @@ private RelBuilder aggregate_(GroupKeyImpl groupKey,
return project(projects.transform((i, name) -> aliasMaybe(field(i), name)));
}

/**
* Removed redundant distinct if an input is already unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would document that this is specifically for aggregates.
And perhaps a better function name would be removeRedundantAggregateDistinct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to removeRedundantAggregateDistinct.

/** Whether to save the distinct if we know that the input is
* already unique; default true. */
@Value.Default
default boolean redundantDistinct() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is a little unintuitive, since it inhibits the optimization rather than enabling it.
All the other similar flags are in the opposite way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I changed default value, now we can set true to optimize. (Default is false)

]]>
</Resource>
</TestCase>
<TestCase name="testRemoveDistinctIfUnique">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the corresponding test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to delete.

Removed it.

</Resource>
<Resource name="plan">
<![CDATA[
LogicalAggregate(group=[{}], EXPR$0=[COUNT($0)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the DISTINCT removed from this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can know empno is primary key by using RelMetadataQuery#areColumnsUnique.

Here is metadata:

empTable.addColumn("EMPNO", fixture.intType, true);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many empno references in the codebase that I couldn't figure out that this is a primary key.
There are also multiple definitions of this column in multiple files, and I didn't know which one is being used here.
Can you please add a comment explaining this in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@JiajunBernoulli
Copy link
Contributor Author

Another question I have is whether this should be as you wrote it in RelBuilder, or it should be a rewrite rule in the optimizer. It is a matter of design choice rather than of correctness.

Thank you for your review.

  1. RelBuilder optimization can reuse subexpressions.
  • Using RelBuilder
LogicalProject(DEPTNO=[$0], CDS=[$1], CS=[$2], SDS=[$3], SS=[$3]) -- SDS is same as SS
  LogicalAggregate(group=[{0}], CDS=[COUNT($1)], CS=[COUNT()], SDS=[SUM($1)])
    LogicalAggregate(group=[{0, 1}])
      LogicalProject(DEPTNO=[$7], SAL=[$5])
        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
  • Using Rule
LogicalProject(DEPTNO=[$0], CDS=[$1], CS=[$2], SDS=[$3], SS=[$4]) -- SDS is same as SS
  LogicalAggregate(group=[{0}], CDS=[COUNT($1)], CS=[COUNT()], SDS=[SUM($1)], SS=[SUM($1)])
    LogicalAggregate(group=[{0, 1}])
      LogicalProject(DEPTNO=[$7], SAL=[$5])
        LogicalTableScan(table=[[CATALOG, SALES, EMP]])

We need other rules to remove same function.

  1. RelBuilder is easier to use than Rule.
  • RelBuilder: withRedundantDistinct(flag) to enable or disable.
  • Rule: Add or remove rule in programs.

Copy link

sonarcloud bot commented Feb 11, 2024

@JiajunBernoulli JiajunBernoulli merged commit ec0dc3c into apache:main Feb 12, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants