Skip to content

[GLUTEN-8557][CH] Flatten nested And/Or for performance optimization#8558

Merged
taiyang-li merged 1 commit intoapache:mainfrom
KevinyhZou:opt_nested_funcs
Apr 3, 2025
Merged

[GLUTEN-8557][CH] Flatten nested And/Or for performance optimization#8558
taiyang-li merged 1 commit intoapache:mainfrom
KevinyhZou:opt_nested_funcs

Conversation

@KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Jan 17, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8557)

How was this patch tested?

test by ut

@KevinyhZou KevinyhZou marked this pull request as draft January 17, 2025 09:11
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 17, 2025
@github-actions
Copy link

#8557

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField Jan 17, 2025
@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField Jan 17, 2025
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject Jan 21, 2025
@KevinyhZou KevinyhZou closed this Jan 21, 2025
@KevinyhZou KevinyhZou reopened this Jan 21, 2025
@KevinyhZou KevinyhZou marked this pull request as ready for review January 21, 2025 11:06
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

5 similar comments
@github-actions
Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Feb 5, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("get_struct_field,get_json_object");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about and, or

throw new GlutenNotSupportException("UDF name is not found!")
}
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦

import org.apache.spark.sql.execution.SparkPlan
import org.apache.spark.sql.types.{DataType, DataTypes}

case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add necessary comment

}

private def canBeOptimized(plan: SparkPlan): Boolean = plan match {
case p: ProjectExecTransformer =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can expression in generate operator be optimized ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems can not


static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const
Copy link
Contributor

@taiyang-li taiyang-li Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgbo-ustc pls review changes related to get_json_object

{
const auto & args = substrait_func.arguments();
if (args.size() != 2)
if (args.size() < 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b'), which may have more than 2 arguments.

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Feb 12, 2025

解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。
已经有 #8305 优化的情况下,这个改动带来哪些变化?

@lgbo-ustc
Copy link
Contributor

建议不同函数的优化拆分到不同的PR里面

mutable size_t total_normalized_rows = 0;

template<typename JSONParser, typename JSONStringSerializer>
void insertResultToColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be complex, I guess there should be a simpler implement with less branches

Copy link
Contributor

@lgbo-ustc lgbo-ustc Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain which case it is for each branch

@github-actions
Copy link

github-actions bot commented Mar 6, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 8, 2025

Run Gluten Clickhouse CI on x86

Copy link
Member

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also introduce a dedicated rule for this optimization? Maybe, a custom And/Or expression class should be defined when necessary?

case _ => Option.empty[String]
}

private def canBeOptimized(expr: Expression): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For no nested And/Or case, is it also viewed as an optimizable case?


def getExpressionName(expr: Expression): Option[String] = expr match {
case _: And => ExpressionMappings.expressionsMap.get(classOf[And])
case _: Or => ExpressionMappings.expressionsMap.get(classOf[Or])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems redundant to the configuration based check.

case _ => exprCall.children.exists(c => canBeOptimized(c))
}
case Some(f) =>
GlutenConfig.get.getSupportedCollapsedExpressions.split(",").exists(c => c.equals(f))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mixed And/Or case, is it viewed as optimizable case?

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

KevinyhZou commented Mar 20, 2025

Can we also introduce a dedicated rule for this optimization? Maybe, a custom And/Or expression class should be defined when necessary?

Now I have implement a dedicated rule for this, introduce the CHAnd, CHOr customized expression for optimization.
Mixed nested and/or would be optimized, I have add this case in the ut, non nested and/or would not. please help to review this . @PHILO-HE

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

cc @PHILO-HE

@PHILO-HE PHILO-HE changed the title [GLUTEN-8557][CH] Collapse nested function calls for And/Or for performance optimization [GLUTEN-8557][CH] Flatten nested And/Or for performance optimization Apr 1, 2025
Copy link
Member

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically looks good!

For such nested function optimization, it's a bit different from get_json_object optimization. So I would like to recommend to use "flatten" instead of "collapse" in the code.

Collapse => Flatten
Collapsed => Flattened

}
}

case class CHAnd(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean)
Copy link
Member

@PHILO-HE PHILO-HE Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use the below name to reflect its usage:
FlattenedAnd

case class CHAnd(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean)
extends CHCollapsedExpression(children, name) {}

case class CHOr(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlattenedOr

@github-actions
Copy link

github-actions bot commented Apr 1, 2025

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

Basically looks good!

For such nested function optimization, it's a bit different from get_json_object optimization. So I would like to recommend to use "flatten" instead of "collapse" in the code.

Collapse => Flatten Collapsed => Flattened

done

Copy link
Member

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions
Copy link

github-actions bot commented Apr 2, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Apr 2, 2025

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@taiyang-li taiyang-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@taiyang-li taiyang-li merged commit 0918db2 into apache:main Apr 3, 2025
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH]Optimize nested function calls

4 participants