New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47639] Support codegen for json_tuple. #45765
base: master
Are you sure you want to change the base?
Conversation
@LuciferYang @cloud-fan Can you help review plz? |
What is the newly added |
Sorry, i have deleted it. |
I haven't looked at the code in detail yet, but I have two questions first:
|
@leixm
|
Sure, i have added codegen disable case. |
I deleted below code, this ut will cause an error(Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"), because
|
Benchmark result:
|
Seems flaky test. |
@@ -3,128 +3,129 @@ Benchmark for performance of JSON parsing | |||
================================================================================================ | |||
|
|||
Preparing data for benchmarking ... | |||
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure | |||
AMD EPYC 7763 64-Core Processor | |||
OpenJDK 64-Bit Server VM 17.0.9+0 on Mac OS X 12.6.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use GitHub Action's machine to generate this file, and also update the result file for Java 21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -272,11 +272,6 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with | |||
assert(jt.eval(null).iterator.to(Seq).head === expected) | |||
} | |||
|
|||
test("json_tuple escaping") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MaxGekk
@LuciferYang @MaxGekk PTAL. |
override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple = | ||
copy(children = newChildren) | ||
|
||
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked into it yet, but is it possible to make codegen simpler and write most of the code in Scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have to consider calculating the foldable expr in advance, which is the reason why doGenCode is bloated. I have tried to simplify the codegen code as much as possible. Do you have any good suggestions?
@leixm Sorry, I've been busy with internal matters at the company recently, so it might take me a while to focus on this PR. |
@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression]) | |||
return nullRow | |||
} | |||
|
|||
val fieldNames = if (constantFields == fieldExpressions.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one idea to simplify the implementation: I think "all constant field names" is the most common case, so we should optimize for it. Mixed case is rather rare and we should just treat it as "no constant field name" to simplify things.
codeList | ||
} | ||
|
||
val splitParseCode = ctx.splitExpressionsWithCurrentInputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to split the method if we don't optimize the mixed case?
What changes were proposed in this pull request?
Support codegen for json_tuple.
Why are the changes needed?
Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen..
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs.
Was this patch authored or co-authored using generative AI tooling?
No.