Skip to content

[GLUTEN-4213][CORE] Refactoring pull out project in HashAggregateExecTransformer#4628

Merged
ulysses-you merged 2 commits intoapache:mainfrom
liujiayi771:agg-project
Feb 5, 2024
Merged

[GLUTEN-4213][CORE] Refactoring pull out project in HashAggregateExecTransformer#4628
ulysses-you merged 2 commits intoapache:mainfrom
liujiayi771:agg-project

Conversation

@liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Feb 3, 2024

What changes were proposed in this pull request?

Separating #4245 into several smaller PRs. The current PR aims to refactor the process of pre/post-projects in HashAggregateExecBaseTransformer.

This PR introduces a new PullOutPostProject rule to add post-project to agg, in order to correct the output of aggregate after native execution.

Example: SELECT sum(c1 + 1) + 1 FROM t GROUP BY c2
Before this rule:

Aggregate(keys=[c2], functions=[sum((c1 + 1))])
   +- Exchange
      +- Aggregate(keys=[c2], functions=[partial_sum((c1 + 1))])
         +- Scan [c1, c2]

After this rule:

Project [(sum((_pre_1)) + 1) AS (sum((c1 + 1)) + 1)]
   +- Aggregate(keys=[c2], functions=[sum((_pre_1))])
      +- Exchange
         +- Aggregate(keys=[c2], functions=[partial_sum(_pre_1)])
            +- Project [c2, (c1 + 1) AS _pre_1]
               +- Scan [c1, c2]

How was this patch tested?

Exists CI.

@github-actions
Copy link

github-actions bot commented Feb 3, 2024

#4213

@github-actions
Copy link

github-actions bot commented Feb 3, 2024

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Feb 3, 2024

Dependent on #4619.
The passing of spark34 CI depends on #4629

@github-actions
Copy link

github-actions bot commented Feb 3, 2024

Run Gluten Clickhouse CI

@liujiayi771
Copy link
Contributor Author

@ulysses-you @zhztheplayer Could you help to review?

@PHILO-HE
Copy link
Member

PHILO-HE commented Feb 4, 2024

Looks there is a test failure, should be ignored? BTW, please rebase the code and resolve code conflict.

- udf/udf-group-by.sql - Scala UDF *** FAILED ***
  udf/udf-group-by.sql - Scala UDF
  Expected "[1.0	1.0]	3", but got "[0.9999999999999999	0.9999999999999999]	3" Result did not match for query #26
  SELECT corr(DISTINCT x, y), udf(corr(DISTINCT y, x)), count(*)
    FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y) (GlutenSQLQueryTestSuite.scala:803)

@liujiayi771
Copy link
Contributor Author

@PHILO-HE Thanks. This failure has been fixed in #4629.

@github-actions
Copy link

github-actions bot commented Feb 4, 2024

Run Gluten Clickhouse CI

@github-actions
Copy link

github-actions bot commented Feb 4, 2024

Run Gluten Clickhouse CI

Copy link
Member

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts! Just few comments.

* native agg to match the output of Spark, ensuring that the data output of the native agg can
* match the fallback Spark plan when a fallback occurs.
*/
object PullOutPostProject
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this rule be extended in the future for other operators, not limited to agg? The above comments make me feel it is only for agg (if not, let's revise a bit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join, window may also need post-project, I will fix the comments later, thanks.

.asInstanceOf[T]

private val applyLocally: PartialFunction[SparkPlan, SparkPlan] = {
case agg: BaseAggregateExec if supportedAggregate(agg) && needsPostProjection(agg) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the check by supportedAggregate? It seems redundant as AddTransformHintRule may already cover it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a SparkPlan match case _ in AddTransformHintRule, it will be tagged as SUPPORT_TRANSFORM.

AddTransformHintRule().apply(transformedPlan)
}

override def applyForValidation[T <: SparkPlan](plan: T): T =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments for this method? Thanks!

PHILO-HE
PHILO-HE previously approved these changes Feb 5, 2024
Copy link
Member

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! @ulysses-you, do you have any comment?


override def apply(plan: SparkPlan): SparkPlan = {
val transformedPlan = plan.transform(applyLocally)
AddTransformHintRule().apply(transformedPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not need apply AddTransformHintRule as we will apply it outside

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ulysses-you
Copy link
Contributor

/Benchmark Velox

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4628_time.csv log/native_master_02_04_2024_d1b29e1bc_time.csv difference percentage
q1 33.00 33.68 0.688 102.08%
q2 24.48 24.23 -0.256 98.96%
q3 38.13 37.37 -0.756 98.02%
q4 35.42 35.96 0.540 101.52%
q5 70.32 70.39 0.069 100.10%
q6 6.63 6.99 0.360 105.42%
q7 83.48 84.51 1.034 101.24%
q8 85.42 84.14 -1.278 98.50%
q9 120.89 121.28 0.397 100.33%
q10 42.99 43.47 0.483 101.12%
q11 20.07 20.46 0.396 101.97%
q12 27.66 26.23 -1.434 94.82%
q13 45.95 45.25 -0.699 98.48%
q14 16.67 15.93 -0.749 95.51%
q15 28.93 26.74 -2.196 92.41%
q16 14.53 14.11 -0.422 97.10%
q17 102.28 102.75 0.474 100.46%
q18 148.43 148.66 0.229 100.15%
q19 12.41 13.50 1.087 108.76%
q20 27.17 26.71 -0.464 98.29%
q21 222.50 221.36 -1.134 99.49%
q22 13.68 13.64 -0.041 99.70%
total 1221.04 1217.37 -3.673 99.70%

@github-actions
Copy link

github-actions bot commented Feb 5, 2024

Run Gluten Clickhouse CI

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you

@ulysses-you ulysses-you merged commit a1a158f into apache:main Feb 5, 2024
@liujiayi771 liujiayi771 deleted the agg-project branch February 5, 2024 12:55
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4628_time.csv log/native_master_02_04_2024_d1b29e1bc_time.csv difference percentage
q1 34.76 33.68 -1.073 96.91%
q2 24.47 24.23 -0.236 99.03%
q3 36.27 37.37 1.103 103.04%
q4 38.04 35.96 -2.073 94.55%
q5 70.49 70.39 -0.094 99.87%
q6 5.41 6.99 1.589 129.39%
q7 82.85 84.51 1.657 102.00%
q8 84.93 84.14 -0.789 99.07%
q9 121.00 121.28 0.281 100.23%
q10 41.65 43.47 1.820 104.37%
q11 19.86 20.46 0.603 103.04%
q12 24.02 26.23 2.207 109.19%
q13 44.86 45.25 0.383 100.85%
q14 16.96 15.93 -1.034 93.91%
q15 30.14 26.74 -3.397 88.73%
q16 14.53 14.11 -0.422 97.10%
q17 101.14 102.75 1.609 101.59%
q18 148.40 148.66 0.253 100.17%
q19 13.35 13.50 0.151 101.13%
q20 27.47 26.71 -0.758 97.24%
q21 218.95 221.36 2.411 101.10%
q22 15.04 13.64 -1.404 90.67%
total 1214.58 1217.37 2.788 100.23%

@KevinyhZou
Copy link
Contributor

KevinyhZou commented Mar 5, 2024

Hi, @liujiayi771 after this pr, we found our result order is reversed when the query contains bothcount and count distinct, which is described in issue #4853, can you take some time to look at this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants