[GLUTEN-9106][VL] Add support for staticInvoke CharVarcharCodegenUtils. #9107#9107
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
@zzcclp @jinchengchenghh Hi, dear Zhichao and Jincheng, could you please review this PR? It addresses issue #9106 . |
| child.map(replaceWithExpressionTransformer0(_, attributeSeq, expressionsMap)), | ||
| i) | ||
| } | ||
| } else if (objectName.endsWith("CharVarcharCodegenUtils")) { |
There was a problem hiding this comment.
Please extract the logic for case i: StaticInvoke => to a new function, here will be much code
There was a problem hiding this comment.
@jinchengchenghh Hi Jincheng, I now extracted the StaticInvoke branch to a new replaceStaticInvokeWithExpressionTransformer function.
Please take a look when you have time, and let me know if there are any further improvements needed. I'm happy to make additional changes if necessary.
|
Run Gluten Clickhouse CI on x86 |
| expressionsMap: Map[Class[_], String]): ExpressionTransformer = { | ||
| val objectName = i.staticObject.getName.stripSuffix("$") | ||
|
|
||
| if (objectName.endsWith("UrlCodec")) { |
There was a problem hiding this comment.
Please use this format to simplify the code, and easy to implement others
val staticInvokeMap = Map("UrlCodec"-> Map("decode" -> ExpressionNames.URL_DECODE, "encode"-> ExpressionNames.URL_ENCODE ))
val transformer = staticInvokeMap.filter(o => objectName.endsWith(o._1))
if (transformer.isEmpty) {
throw
}
There was a problem hiding this comment.
Hi @jinchengchenghh . Here I notice there's a diff between how "UrlCodec" and "CharVarcharCodegenUtils" handles StaticInvoke's arguments.
Turns out that, "UrlCodec" only uses 1st arg val child = i.arguments.head which is the column identifier but had the 2nd argument "UTF-8" ignored. while CharVarcharCodegenUtils takes all the params (i.e. col identifier & length limit) by val children = i.arguments.
So I'm a bit unsure how we could better handle such discrepancy here since additional args handling logic would make it complex again.
There was a problem hiding this comment.
I think most of the cases we need all the argument, UrlCodec is a special case, maybe because the backend only supports "UTF-8" , we should throw exception if 2nd argument is not "UTF-8" but not.
So other common cases should use the staticInvokeMap, that can also help developer figure out the special case.
There was a problem hiding this comment.
@jinchengchenghh using a staticInvokeMap in the latest fix:
- a staticInvokeMap to define the supported objects and their functions and for future additions.
- special handling for UrlCodec which only uses the 1st argument
Could you please review the refactor and let me know if further adjustment is needed. thks
| final val UDF_PLACEHOLDER = "udf_placeholder" | ||
| final val UDAF_PLACEHOLDER = "udaf_placeholder" | ||
|
|
||
| // Spark Catalyst util functions |
There was a problem hiding this comment.
Spark StaticInvoke Catalyst util functions
|
And I think the static invokes functions may not support in CH backend, so only support in velox backend |
We haven't explored CH backend much, so far. But base on my knowledge if CH does not support, maybe similarly, a native validation would fail and thus fallback on such calls, but further investigation of code is needed. |
|
Run Gluten Clickhouse CI on x86 |
|
Yes, you don't need to support it, just make sure it is fallback in CH backend. |
Sure I'll look into it right away. Please maybe mask the passwd later for safety? I have had the passwd saved myself just now. thx. @jinchengchenghh |
|
The password is in the CH backend document, but hard to find, it is open to all |
Got it. Thx >:) |
|
Hi @jinchengchenghh I have several findings and would like to check with you, please correct me if Im wrong.
|
|
Just add the new functions name to CH_BLACKLIST_SCALAR_FUNCTION like URL_DECODE |
Hi @jinchengchenghh some new findings,
Can we for now intercept unsupp staticInvoke functions for CH at |
|
Yes, I agree. Add the substrait name check by ValidatorApi::doExprValidate in replaceStaticInvokeWithExpressionTransformer rather than directly calling the ExpressionNames.URL_DECODE. |
|
Run Gluten Clickhouse CI on x86 |
|
Hi, @jinchengchenghh, I've added an extra |
|
These function are not merged in velox, we cannot add a unit test to it, can we merge these function native implement first? |
Sorry for the late reply. The current change is ok for CH. |
Thanks alot @jinchengchenghh and @taiyang-li , so does that mean we can now move to the Velox side PR now? link |
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
Current PR awaits the PR merge at Velox side and is not stale. |
rui-mo
left a comment
There was a problem hiding this comment.
Thanks! Looks good overall.
| GenericExpressionTransformer(exprName, childTransformers, i) | ||
| } | ||
|
|
||
| if (Seq("encode", "decode").contains(i.functionName) && i.objectName.endsWith("UrlCodec")) { |
There was a problem hiding this comment.
Are the comparisons case-sensitive?
| ) | ||
| } | ||
|
|
||
| if (i.functionName == "isLuhnNumber") { |
| ) | ||
| } | ||
|
|
||
| if (Seq("encode", "decode").contains(i.functionName) && i.objectName.endsWith("Base64")) { |
| } | ||
|
|
||
| if ( | ||
| Set("varcharTypeWriteSideCheck", "charTypeWriteSideCheck", "readSidePadding").contains( |
| // Add test suite for CharVarcharCodegenUtils functions. | ||
| // A ProjectExecTransformer is expected to be constructed after expr support. | ||
| // We currently test below functions with Spark v3.4 | ||
| testWithMinSparkVersion("Test charTypeWriteSideCheck function", "3.4") { |
There was a problem hiding this comment.
nit: could be rename as "charTypeWriteSideCheck" to be consistent in style with other cases. And are these functions only supported after Spark 3.4?
There was a problem hiding this comment.
Yes, all these functions should be tested with min Spark version 3.4,
readSidePadding
was introduced atbranch-3.4insql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java.charTypeWriteSideCheck&varcharTypeWriteSideCheck
prior to 3.4, Spark's analyzer introduces an extra redundant cast(... as string) expression. The current transformer for StaticInvoke does not handle this specific pattern with the extra cast, which causes a fallback to the vanilla Spark ProjectExec.
|
Hi @rui-mo ! I saw spark-tests fail and test fails were due to Exception type/message mismatches, not because of incorrect functionality, e.g.: the test expect "did not contain "Exceeds char/varchar type length limitation: 5", but actually got: "(6 vs. 5) Exceeds allowed length limitation: 5",
Since we should not touch Spark's tests and should remain the error msg from Velox side, I wonder the correct way we handle current situation is to exclude the vanilla spark tests in the and then assert |
|
@Yifeng-Wang Yes, for the error msg mismatch we usually exclude Spark's test, and rewrite it in the Gluten suite. |
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
de06b42 to
c04e793
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Hi @rui-mo , the tests fails are now solved for each of the Spark's version. The test suites are updated to make correct message assertions. Would you please help me take a look at the changes when you have time? Thanks! |
| // Add test suite for CharVarcharCodegenUtils functions. | ||
| // A ProjectExecTransformer is expected to be constructed after expr support. | ||
| // We currently test below functions with Spark v3.4 | ||
| testWithMinSparkVersion("charTypeWriteSideCheck", "3.4") { |
There was a problem hiding this comment.
Why only test with Spark3,4? Does Spark 3.3 not have this function?
There was a problem hiding this comment.
Hi, thanks for pointing out. From my obervation turns out that although Spark 3.3 does have this function, the execution plan is different, not triggering staticInvoke, thus tested with min=Spark 3.4
| GenericExpressionTransformer(exprName, childTransformers, i) | ||
| } | ||
|
|
||
| if (Seq("encode", "decode").contains(i.functionName) && i.objectName.endsWith("UrlCodec")) { |
There was a problem hiding this comment.
Could you use case match for the function name? This will be more clean
| if (transformer.isEmpty) { | ||
| throw new GlutenNotSupportException(s"Not supported staticInvoke call object: $objName") | ||
|
|
||
| if ( |
There was a problem hiding this comment.
Please do not expand the replaceIcebergStaticInvoke here, this function is a bit long and will be longger
| .exclude("length check for input string values: with implicit cast") | ||
| .exclude("char/varchar type values length check: partitioned columns of other types") | ||
| enableSuite[GlutenDSV2CharVarcharTestSuite] | ||
| .exclude("length check for input string values: top-level columns") |
There was a problem hiding this comment.
Is there only the error message difference? Is there any change to align the error message?
There was a problem hiding this comment.
Looks like even if we align the error message, we still need to copy the tests, so let us keep current solution.
There was a problem hiding this comment.
Ok. Keep current solution.
|
Please resolve the conflict, thanks! |
Yes I am doing it right now 🤝 |
…nvoke` logics intact.
c083639 to
a0fbecb
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
The latest commit added |
| .includeCH("length check for input string values: nested in both map key and value") | ||
| .includeCH("length check for input string values: nested in array of struct") | ||
| .includeCH("length check for input string values: nested in array of array") | ||
| .excludeGlutenTest("length check for input string values: top-level columns") |
There was a problem hiding this comment.
Would you please add comments above to illustrate the reason to exclude these tests (because they are rewritten due to ...)?
| if (!BackendsApiManager.getValidatorApiInstance.doExprValidate(ExpressionNames.BASE64, i)) { | ||
| throw new GlutenNotSupportException( | ||
| s"Not supported to map current ${i.getClass} call on function: ${i.functionName}.") | ||
| } |
There was a problem hiding this comment.
Thanks for the refactor. This validations seem to be newly introduced, would you please clarify a bit?
There was a problem hiding this comment.
Hi rui-mo, thanks for mentioning this. Apologize for such a long context within this PR.
This doExprValidate call on validator instance was added deliberately to ensure backend actually supports.
Technically this was from my earlier findings (if you scroll up a bit, on Mar 26) , and currently it ensures that:
- An explicit call on
doExprValidate. CH_BLACKLIST_SCALAR_FUNCTIONis looked through during e.g.,CHValidatorApi::doExprValidateso that the functions which are ONLY supported at Velox side, won't cause "Unknown function parser" for clickhouse BE(in CI tests).- Intercept at Gluten with
GlutenNotSupportException, instead of sending to actual unsupported (ClickHouse) backend.
| val exprName = fn match { | ||
| case "varcharTypeWriteSideCheck" => ExpressionNames.VARCHAR_TYPE_WRITE_SIDE_CHECK | ||
| case "charTypeWriteSideCheck" => ExpressionNames.CHAR_TYPE_WRITE_SIDE_CHECK | ||
| case "readSidePadding" => ExpressionNames.READ_SIDE_PADDING |
There was a problem hiding this comment.
The mappings from Spark's name to Velox's name are more commonly to be put at https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitParser.cc#L387 to allow compatibility among different backends.
There was a problem hiding this comment.
@rui-mo Hi rui-mo, could you help give me some guidance on how I should update the codes?
Currently the gluten-velox run of my expression convert does not seems like relying on this map?
There was a problem hiding this comment.
To add some context,
at Gluten side in ExpressionNames.scala. I had e.g.:
final val VARCHAR_TYPE_WRITE_SIDE_CHECK = "varchar_type_write_side_check" .
And at velox side, there is
registerFunction<
VarcharTypeWriteSideCheckFunction,
Varchar,
Varchar,
int32_t>
({prefix + "varchar_type_write_side_check"});
so I guess substrait function name (from ExpressionNames.VARCHAR_TYPE_WRITE_SIDE_CHECK = "varchar_type_write_side_check") already matches the Velox registered function name (prefix + "varchar_type_write_side_check").
There was a problem hiding this comment.
Typically, in Gluten’s ExpressionNames.scala, the Spark function names are defined, while in the C++ map, the corresponding function names for a specific backend are provided. For instance, the Spark function add might be represented as plus in certain backends. To maintain consistency, Gluten’s Scala side keeps Spark’s original naming (e.g., add), and it is the responsibility of each backend to map it to the appropriate equivalent name. This is just a minor suggestion—please feel free to decide whether you’d like to make this change.
There was a problem hiding this comment.
Hi rui-mo! Thanks for clarifying the reasoning behind this design. Maybe I'll keep the current state for this PR, but I'll be sure to follow this mapping mechanism in subsequent PR. Really appreciate the guidance!
|
Run Gluten Clickhouse CI on x86 |
|
Hi @rui-mo , @jinchengchenghh ! Thank you so much for your thorough review and patience on this long-running PR. Your feedback was invaluable. |






What changes were proposed in this pull request?
To add support for static function callls in Spark's CharVarcharCodegenUtils.java, we add expr matching pattern in Gluten ExpressionConverter.
(Fixes: #9106)
How was this patch tested?
Added test cases in Gluten's ScalarFunctionsValidateSuite.
Built Gluten Velox with Spark 3.4 (Spark3.4.4 in pom.xml default) and test sqls.
for test_varcharWriteSideCheck:
for charWriteSideCheck:
check plan to validate such operations offload to native and do not cause operator fallback:
