[GLUTEN-4213][CORE] Refactoring insertion process of pre/post projection by liujiayi771 · Pull Request #4245 · apache/gluten

liujiayi771 · 2024-01-02T05:53:23Z

What changes were proposed in this pull request?

Implement #4213.
Introduced three kinds of Rules.

PullOutProject. Pulling out pre-project at the LogicalPlan level. Currently, it only supports the Velox backend and can reduce the number of pre-projects when agg includes distinct.
ColumnarPullOutProject (ColumnarPullOutPostProject + ColumnarPullOutPreProject). Pulling out pre/post-project at the SparkPlan level. PullOutProject cannot handle all scenarios (e.g., Aggregate introduced by InjectRuntimeFilter will be executed before PullOutProject, and some Expressions will be generated in Strategy). The missing parts will be handled completely by ColumnarPullOutProject. Some information required for post-project is more easily obtained at the physical plan level, hence it is handled there.
GlutenPlanPullOutProject. Handling the case of constructing a Gluten transformer directly in TakeOrderedAndProjectExecTransformer.

Currently, only agg and sort have been incorporated into this framework. In the future, support for operators such as join and window that require pre/post projection will be added.

Next steps:

Support join/window in this framework
Add a rule for row_constructor required by velox backend.

How was this patch tested?

Exists CI.

github-actions · 2024-01-02T05:53:42Z

#4213

github-actions · 2024-01-02T05:53:55Z

Run Gluten Clickhouse CI

github-actions · 2024-01-02T05:56:04Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-01-02T05:59:34Z

@zhztheplayer @rui-mo Could you help review? I have already validated this modification on TPCDS. Using this framework to insert pre/post projection can eliminate a significant amount of redundant code in the transformer. The previous approach required many if-else branches based on whether to insert projection and whether it was for validation. It also eliminated the need to construct projection based on an index.

liujiayi771 · 2024-01-02T06:04:16Z

@waitinfuture Could you help review?

github-actions · 2024-01-02T06:21:16Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-01-02T06:54:05Z

Design doc #4213 (comment)

github-actions · 2024-01-02T06:56:55Z

Run Gluten Clickhouse CI

rui-mo

Thanks for the nice refactor. In the meantime, could you also check if metrics work well? Some relevant code to handle the metrics of pre/post projection could be removed.

rui-mo · 2024-01-02T06:48:58Z

gluten-core/src/main/scala/io/glutenproject/execution/HashAggregateExecBaseTransformer.scala

      expr =>
        expr.filter match {
-          case None | Some(_: Attribute) | Some(_: Literal) =>
+          case None | Some(_: Attribute) =>


Why literal is removed here?

There are two reasons for this:

If the filter condition is a Literal, it can only be of boolean type. Such filters are typically removed during the optimization process in Spark. You can refer to the "EliminateAggregateFilter" Rule in Spark for more information.

filter in velox only support FieldAccessTypedExpr, but it is possible that CK supports Literal filter.

Thanks. Makes sense.

rui-mo · 2024-01-02T07:17:36Z

Seems there is a PR #3649 proposing the similar refactor. @ulysses-you @liujiayi771 Could you help check on that? Thanks.

liujiayi771 · 2024-01-02T10:02:20Z

The work done by these two PRs is essentially the same, with the difference being that #3649 modifies the logical plan, while my PR modifies the physical plan, and my PR also support post-projection for agg. For the sort operator, both pre and post projection can be modified in the logical plan. For agg, the pre projection can be modified in the logical plan, but the post projection can only be modified in the physical plan if the native output doesn't match with the resultExpressions in Spark's output.

Initially, I also considered doing it in the logical plan to avoid impacting validation and AQE. I think we can combine both approaches, doing the parts that can be done in the logical plan in the logical rule, but for the sort in TakeOrderedAndProjectExecTransformer, it should be done only in the physical plan. @ulysses-you I didn't notice your PR before. I searched for issues related to project but did not check the pull requests. I would like to hear your opinion, as our approaches are fairly similar.

github-actions · 2024-01-02T11:16:44Z

Run Gluten Clickhouse CI

ulysses-you · 2024-01-02T11:57:02Z

I think the main goal to pull out pre/post project is:

make the transformer plan tree align with native plan tree; e.g., if we have a native project then we must have a project transformer
decouple pre/post project fallback with the original operator; e.g., if we support transform aggregate but does not support post project, then we should only fallback post project
avoid expression multi-evaluation; e.g., t1 join t2 on c1 + 1 = c2, say it's a shuffled hash join then we will evaluate c1 + 1 multi-times, one for shuffle, one for pre-project

One option is that, we can do pull out pre-project at logical side and do pull out post-project at columnar side.

liujiayi771 · 2024-01-02T12:17:11Z

@ulysses-you Aggree with you. I can continue to modify the pre-projection part into logical rule if you'd like, or would you prefer to continue working on #3649?

ulysses-you · 2024-01-02T12:32:21Z

@liujiayi771 it's fine to go ahead in this pr, thank you

liujiayi771 · 2024-01-03T08:12:48Z

@ulysses-you I have identified an issue where, if we modify the logical plan, the extendedOperatorOptimizationRules we insert is placed before DecimalAggregates. DecimalAggregates converts sum/avg(decimal attr) into sum/decimal(unscaledValue(decimal attr)), but the unscaledValue cannot be seen in our rule. This results in the required pre-project not being added.

Maybe we should use ExperimentalMethods.extraOptimizations or postHocOptimizationBatches? I currently do not know how to use postHocOptimizationBatches.

ulysses-you · 2024-01-03T08:23:41Z

sparkSession.experimental.extraOptimizations = sparkSession.experimental.extraOptimizations ++ Seq(yourRule)
Does it work ?

liujiayi771 · 2024-01-03T09:27:09Z

sparkSession.experimental.extraOptimizations ++

Yes, this approach will work, but do you think it is reasonable to use ExperimentalMethods? Will Spark remove this class in the future? However, there is no other way to add rule at the end. Currently, it seems that there is no mechanism in Gluten that allows modifying the extraOptimizations in the spark session right after it is launched.

liujiayi771 · 2024-01-03T09:50:55Z

@ulysses-you One method I can think of is to add a rule through injectCheckRule, before the optimization step. This rule would only perform the modifications on the sparkSession. However, this approach might be considered as a hack.

case class AddExtraOptimizations(sparkSession: SparkSession) extends (LogicalPlan => Unit) {

  override def apply(plan: LogicalPlan): Unit = {
    sparkSession.experimental.extraOptimizations = sparkSession.experimental.extraOptimizations ++
      Seq(InsertPreProject)
  }
}

ulysses-you · 2024-01-03T11:18:01Z

I think it's ok, Spark won't remove public interface in general. We can argue that if someone create a pr to remove it.

github-actions · 2024-01-05T07:54:55Z

Run Gluten Clickhouse CI

github-actions · 2024-01-05T09:12:05Z

Run Gluten Clickhouse CI

github-actions · 2024-01-19T11:37:58Z

Run Gluten Clickhouse CI

github-actions · 2024-01-19T11:57:38Z

Run Gluten Clickhouse CI

github-actions · 2024-01-21T16:03:25Z

Run Gluten Clickhouse CI

github-actions · 2024-01-21T16:55:10Z

Run Gluten Clickhouse CI

zhztheplayer

Thanks!

And would you like to update the PR description to add some words to summarize the newly added rules? This would help others understand the changes.

(probably including)

ColumnarPullOutProject (ColumnarPullOutPostProject + ColumnarPullOutPreProject)
GlutenPlanPullOutProject
PullOutProject

zhztheplayer · 2024-01-22T02:18:06Z

gluten-core/src/main/scala/io/glutenproject/extension/GlutenPlan.scala

+  /**
+   * Merge the results of two ValidationResult objects, including combining the reasons message for
+   * invalid ValidationResult.
+   *   - valid merge valid = valid
+   *   - invalid merge valid = invalid
+   *   - invalid merge invalid = invalid
+   */
+  def merge(first: ValidationResult, second: ValidationResult): ValidationResult = {
+    if (first.isValid && second.isValid) {
+      ok
+    } else {
+      val reasonStr = first.reason.getOrElse("") + second.reason.getOrElse("")
+      notOk(reasonStr)
+    }
+  }


I didn't find usage of this method. Am I missing some thing?

Yes, we can remove it now.

zhztheplayer · 2024-01-22T02:51:33Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/TransformHintRule.scala

+              val pulledOutSortExec =
+                ColumnarPullOutProject.getPulledOutPlanLocally[SortExec](sortExec)


Need for this kind of statement would mean rule AddTransformHintRule is coupled with ColumnarPullOutProject.

So do we have chance to make ColumnarPullOutProject more independent? I think we had the design to allow a rule rely on tags generated by AddTransformHintRule but probably we'd better to avoid AddTransformHintRule from depending on other rules.

ColumnarPullOutProject will pull out ProjectExec and need to verify if ProjectExec can be converted to native plan. If we want to decouple ColumnarPullOutProject from TransformHintRule, we can place it before TransformHintRule, which is also feasible. Initially, I implemented it this way, but I encountered an issue where ClickHouse's custom agg #3629 (comment) throws an exception when determining if post-project is needed. It may require ClickHouse's assistance to redesign the API for custom agg and not rely on throwing exceptions in getAttrsIndexForExtensionAggregateExpr for fallback, but instead trigger the fallback logic in doValidationInternal.

We can proceed with the modifications step by step. For now, let's place the validation of ProjectExec within the rule itself. ColumnarPullOutProject will only validate ProjectExec. I understand that there are similar codes in other places in Gluten that tag hints, and we can handle them together later.

I understand that there are similar codes in other places in Gluten that tag hints, and we can handle them together later.

Actually I think we may allow adding validation tags to the original physical plan in rules other than AddTransformHintRule as of now, just as if multiple rules are doing validation together. However I am not sure whether the code you mentioned above can be a good case so it might still be needed to optimize, I'll take a look as well.

I encountered an issue where ClickHouse's custom agg #3629 (comment) throws an exception when determining if post-project is needed.

This is interesting... So my feeling is we may have to re-think how should we handle backend-specific pre/post project creation code when doing the refactor. Say, if backend A has some specialized conditions to decide whether a project should be pulled out from a plan node, the we'd provide extensibility to have it customized?

Also in the patch, code of the new feature is currently located into several places, including logical optimization, validation (transform hint), and physical optimization (the actual pulling logic). So I slightly feel that the complexity added to Gluten is a little bit higher than what we expected? Do we have chance to reduce? At the same time I am just more worried about coupling of the rules in this PR. Do you think we can add some new methods to backend API to deal with the CH Agg issue you mentioned?

The part about "custom agg" may not have been expressed clearly. Currently, both CH and Velox require a "post project" process. The logic is the same, which is to convert the output on the native side to a consistent output for Spark. What I mean is that CH throws an exception when retrieving the native output, which requires validation before pulling it out. It doesn't mean that CH "post project" process is different from Velox's and has custom requirements. In fact, it is about the getAttrForAggregateExprs method that retrieves the actual output of the aggregation. Based on this output, a "post-project" is constructed. Velox also has validation logic that throws exceptions. I also hope to include this logic in the doValidationInternal method like this. CH has custom aggregation requirements, and validation is also performed when retrieving the output for custom agg. For example, the CustomSum only supports Final code. Since I am not familiar with the specific logic of other custom agg in CH, I cannot move the validation logic into doValidationInternal like in Velox. It may require CH's developers to redesign this part. However, this part is not essential and can be improved in future development.

Initially, I implemented it this way, but I encountered an issue where ClickHouse's custom agg #3629 (comment) throws an exception when determining if post-project is needed.

@liujiayi771 Maybe we could try to call getAttrForAggregateExprs method in doValidationInternal for CH backend. With this issue solved, can we make ColumnarPullOutProject independent?

gluten-ut/common/src/test/scala/org/apache/spark/sql/GlutenSQLTestsBaseTrait.scala

github-actions · 2024-01-22T04:35:18Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-01-22T05:01:53Z

@JkSelf Could you also help to take a look?

liujiayi771 · 2024-01-22T05:04:24Z

gluten-core/src/main/scala/io/glutenproject/execution/SortExecTransformer.scala


-  override def outputOrdering: Seq[SortOrder] = sortOrder
+  override def outputOrdering: Seq[SortOrder] = child match {
+    case project: ProjectExecTransformer if ProjectTypeHint.isPreProject(project) =>


@JkSelf The issue with outputOrdering encountered earlier is currently resolved as this.

github-actions · 2024-01-22T10:22:09Z

Run Gluten Clickhouse CI

JkSelf · 2024-01-23T03:10:41Z

gluten-core/src/main/scala/io/glutenproject/extension/PullOutProject.scala

+          true
+        case _ => false
+      }.isDefined)
+    case Sort(order, _, _) =>


@liujiayi771 We have added a sort check in the needPreProject method. However, it appears that the logic for handling sort operators is not being added in this context here.

I used to pullout project for Sort in LogicalPlan rule. But in this way, the outputOrdering issue cannot be solved easily. So I move this logical to SparkPlan rule, and this will not have performance issues like agg. We can remove the Sort case in this place.

Could you provide more information about the outputOrdering issue you mentioned? Maybe i missed some context. Thanks.

@JkSelf You can check this discussion.

JkSelf · 2024-01-23T03:10:54Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+            // post-projection is needed.
+            true
+        }
+      case _ => false


Do we need the sort check here?

Sort is different from Agg, if it has pre-project, it always needs a post-project, so I pullout pre and post project together in ColumnarPullOutPreProject for SortExec.

rui-mo

Thanks for your work. Added several comments.

rui-mo · 2024-01-24T03:41:25Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/TransformHintRule.scala

+              val pulledOutSortExec =
+                ColumnarPullOutProject.getPulledOutPlanLocally[SortExec](sortExec)


Initially, I implemented it this way, but I encountered an issue where ClickHouse's custom agg #3629 (comment) throws an exception when determining if post-project is needed.

@liujiayi771 Maybe we could try to call getAttrForAggregateExprs method in doValidationInternal for CH backend. With this issue solved, can we make ColumnarPullOutProject independent?

rui-mo · 2024-01-24T03:47:07Z

gluten-core/src/main/scala/io/glutenproject/extension/PullOutProject.scala

+ * This rule will insert a pre-project in the child of operators such as Aggregate, Sort, Join,
+ * etc., when they involve expressions that need to be evaluated in advance.
+ */
+case class PullOutProject(session: SparkSession)


Why this class is named as PullOutProject if it aims to insert a pre-project? I also feel we are lacking some key information in the class descriptions. E.g. PullOutProject works on logical plan level, what cases are covered in this rule, and what are the steps to insert a project.

rui-mo · 2024-01-24T03:48:51Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+  }
+}
+
+object ColumnarPullOutProject extends Rule[SparkPlan] {


For the newly introduced rules, maybe we can provide more information about their functionality and usage. For this one, especially the difference with PullOutPreProject.

rui-mo · 2024-01-24T03:49:17Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+  override def apply(plan: SparkPlan): SparkPlan = applyPullOutColumnarPreRules(plan)
+}
+
+case class ColumnarPullOutPostProject(validation: Boolean = false)


rui-mo · 2024-01-24T03:52:48Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+        child = preProject
+      )
+      newSort.copyTagsFrom(sort)
+      ProjectExecTransformer(sort.child.output, newSort).fallbackIfInvalid


Should we throw for other plan?

rui-mo · 2024-01-24T03:52:53Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+}
+
+/** This rule only used for situation that directly create GlutenPlan. */
+object GlutenPlanPullOutProject extends Rule[SparkPlan] with PullOutProjectHelper {


What does directly create GlutenPlan mean? Better to clarify a bit. It seems only Sort is covered in this rule, can we add the reason?

rui-mo · 2024-01-24T03:54:56Z

gluten-core/src/main/scala/io/glutenproject/execution/BasicPhysicalOperatorTransformer.scala

  override protected def withNewChildInternal(newChild: SparkPlan): ProjectExecTransformer =
    copy(child = newChild)
+
+  def fallbackIfInvalid: SparkPlan = {


If validation fails, fallback to vanilla Spark and add NotTransformable tag.

The same functionality should be covered by existing rules. Is it possible to remove the duplicate check here?

rui-mo · 2024-01-24T03:59:57Z

gluten-core/src/main/scala/io/glutenproject/extension/columnar/ColumnarPullOutProject.scala

+      .isDefined && plan.getTagValue(TAG).get.isInstanceOf[PRE_PROJECT]
+  }
+
+  def tagPostProject(plan: SparkPlan): Unit = {


What kind of project is regarded as post-project? Maybe add a clear definition here. Same for pre-project.

github-actions · 2024-03-10T01:44:37Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-03-21T01:43:58Z

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

liujiayi771 force-pushed the pre-post-project-v2 branch from 90c92e6 to 3ee879f Compare January 2, 2024 05:55

liujiayi771 force-pushed the pre-post-project-v2 branch from 3ee879f to f89ffc0 Compare January 2, 2024 06:20

rui-mo requested a review from zzcclp January 2, 2024 06:46

liujiayi771 force-pushed the pre-post-project-v2 branch from f89ffc0 to f23ee20 Compare January 2, 2024 06:56

rui-mo reviewed Jan 2, 2024

View reviewed changes

liujiayi771 force-pushed the pre-post-project-v2 branch from f23ee20 to f556a66 Compare January 2, 2024 11:16

liujiayi771 marked this pull request as draft January 3, 2024 07:13

liujiayi771 force-pushed the pre-post-project-v2 branch from 03828ef to 00211d1 Compare January 5, 2024 09:11

liujiayi771 force-pushed the pre-post-project-v2 branch from 00211d1 to 8059246 Compare January 7, 2024 15:05

liujiayi771 added 4 commits January 19, 2024 19:37

Apply extended columnar pre rules in GlutenFormatWriterInjects

32d3039

Fix post-project Alias name

85cc329

Clickhouse use columnar rule

5cd30a3

ClickHouse not exclude vanilla subquery test cases

21d3d78

liujiayi771 force-pushed the pre-post-project-v2 branch from 1d17f36 to 6227408 Compare January 19, 2024 11:37

support fallback project seperately

f4e968f

liujiayi771 force-pushed the pre-post-project-v2 branch from 6227408 to f4e968f Compare January 19, 2024 11:57

Use SparkPlan transformer for physical pre/post-project

b4cdbea

liujiayi771 force-pushed the pre-post-project-v2 branch from c868592 to b4cdbea Compare January 21, 2024 16:54

zhztheplayer reviewed Jan 22, 2024

View reviewed changes

liujiayi771 commented Jan 22, 2024

View reviewed changes

intro project type hint

9b2130b

liujiayi771 force-pushed the pre-post-project-v2 branch from 7ca2d66 to 9b2130b Compare January 22, 2024 10:21

JkSelf reviewed Jan 23, 2024

View reviewed changes

liujiayi771 mentioned this pull request Jan 23, 2024

[GLUTEN-4213][CORE] Refactoring pull out project in SortExecTransformer #4497

Merged

zhztheplayer mentioned this pull request Jan 24, 2024

[CORE][VL][CH] Sub-tasks tracker of adding pre/post projections #4503

Open

rui-mo reviewed Jan 24, 2024

View reviewed changes

liujiayi771 mentioned this pull request Feb 3, 2024

[GLUTEN-4213][CORE] Refactoring pull out project in HashAggregateExecTransformer #4628

Merged

github-actions bot added the stale stale label Mar 10, 2024

github-actions bot closed this Mar 21, 2024

		val pulledOutSortExec =
		ColumnarPullOutProject.getPulledOutPlanLocally[SortExec](sortExec)

Conversation

liujiayi771 commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 2, 2024

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 2, 2024

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

rui-mo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liujiayi771 Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2024

Uh oh!

ulysses-you commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 2, 2024

Uh oh!

ulysses-you commented Jan 2, 2024

Uh oh!

liujiayi771 commented Jan 3, 2024

Uh oh!

ulysses-you commented Jan 3, 2024

Uh oh!

liujiayi771 commented Jan 3, 2024

Uh oh!

liujiayi771 commented Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ulysses-you commented Jan 3, 2024

Uh oh!

github-actions bot commented Jan 5, 2024

Uh oh!

github-actions bot commented Jan 5, 2024

Uh oh!

github-actions bot commented Jan 19, 2024

Uh oh!

github-actions bot commented Jan 19, 2024

Uh oh!

github-actions bot commented Jan 21, 2024

Uh oh!

github-actions bot commented Jan 21, 2024

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liujiayi771 commented Jan 2, 2024 •

edited

Loading

liujiayi771 Jan 2, 2024 •

edited

Loading

liujiayi771 commented Jan 2, 2024 •

edited

Loading

liujiayi771 commented Jan 3, 2024 •

edited

Loading

zhztheplayer Jan 22, 2024 •

edited

Loading

liujiayi771 Jan 22, 2024 •

edited

Loading