[GLUTEN-8557][CH] Flatten nested And/Or for performance optimization#8558
[GLUTEN-8557][CH] Flatten nested And/Or for performance optimization#8558taiyang-li merged 1 commit intoapache:mainfrom
And/Or for performance optimization#8558Conversation
|
Run Gluten Clickhouse CI on x86 |
And, GetStructField
And, GetStructFieldAnd/GetStructField
|
Run Gluten Clickhouse CI on x86 |
And/GetStructFieldAnd/Or/GetStructField/GetJsonObject
|
Run Gluten Clickhouse CI on x86 |
b86d9e7 to
2cc64f2
Compare
|
Run Gluten Clickhouse CI on x86 |
5 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
性能测试 |
| .internal() | ||
| .doc("Collapse nested functions as one for optimization.") | ||
| .stringConf | ||
| .createWithDefault("get_struct_field,get_json_object"); |
| throw new GlutenNotSupportException("UDF name is not found!") | ||
| } | ||
| val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) | ||
| var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) |
There was a problem hiding this comment.
collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦
| import org.apache.spark.sql.execution.SparkPlan | ||
| import org.apache.spark.sql.types.{DataType, DataTypes} | ||
|
|
||
| case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] { |
| } | ||
|
|
||
| private def canBeOptimized(plan: SparkPlan): Boolean = plan match { | ||
| case p: ProjectExecTransformer => |
There was a problem hiding this comment.
can expression in generate operator be optimized ?
There was a problem hiding this comment.
It seems can not
|
|
||
| static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; } | ||
|
|
||
| bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const |
There was a problem hiding this comment.
@lgbo-ustc pls review changes related to get_json_object
| { | ||
| const auto & args = substrait_func.arguments(); | ||
| if (args.size() != 2) | ||
| if (args.size() < 2) |
There was a problem hiding this comment.
get_json_object(get_json_object(d, '$.a'), '$.b') => optimize to get_json_object(d, '$.a', '$.b'), which may have more than 2 arguments.
|
解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。 |
|
建议不同函数的优化拆分到不同的PR里面 |
| mutable size_t total_normalized_rows = 0; | ||
|
|
||
| template<typename JSONParser, typename JSONStringSerializer> | ||
| void insertResultToColumn( |
There was a problem hiding this comment.
It seems to be complex, I guess there should be a simpler implement with less branches
There was a problem hiding this comment.
Should explain which case it is for each branch
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
f4738d0 to
2c6d6f6
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
PHILO-HE
left a comment
There was a problem hiding this comment.
Can we also introduce a dedicated rule for this optimization? Maybe, a custom And/Or expression class should be defined when necessary?
| case _ => Option.empty[String] | ||
| } | ||
|
|
||
| private def canBeOptimized(expr: Expression): Boolean = { |
There was a problem hiding this comment.
For no nested And/Or case, is it also viewed as an optimizable case?
|
|
||
| def getExpressionName(expr: Expression): Option[String] = expr match { | ||
| case _: And => ExpressionMappings.expressionsMap.get(classOf[And]) | ||
| case _: Or => ExpressionMappings.expressionsMap.get(classOf[Or]) |
There was a problem hiding this comment.
This check seems redundant to the configuration based check.
| case _ => exprCall.children.exists(c => canBeOptimized(c)) | ||
| } | ||
| case Some(f) => | ||
| GlutenConfig.get.getSupportedCollapsedExpressions.split(",").exists(c => c.equals(f)) |
There was a problem hiding this comment.
For mixed And/Or case, is it viewed as optimizable case?
|
Run Gluten Clickhouse CI on x86 |
fb730ea to
578808b
Compare
|
Run Gluten Clickhouse CI on x86 |
578808b to
2dbad5d
Compare
|
Run Gluten Clickhouse CI on x86 |
Now I have implement a dedicated rule for this, introduce the |
|
Run Gluten Clickhouse CI on x86 |
8ac225b to
68ad6fd
Compare
|
Run Gluten Clickhouse CI on x86 |
|
cc @PHILO-HE |
And/Or for performance optimizationAnd/Or for performance optimization
| } | ||
| } | ||
|
|
||
| case class CHAnd(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean) |
There was a problem hiding this comment.
Suggest to use the below name to reflect its usage:
FlattenedAnd
| case class CHAnd(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean) | ||
| extends CHCollapsedExpression(children, name) {} | ||
|
|
||
| case class CHOr(dataType: DataType, children: Seq[Expression], name: String, nullable: Boolean) |
68ad6fd to
0685f0c
Compare
|
Run Gluten Clickhouse CI on x86 |
done |
|
Run Gluten Clickhouse CI on x86 |
7cc2137 to
6d31ab9
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
(Fixes: #8557)
How was this patch tested?
test by ut