[SPARK-47639] Support codegen for json_tuple. #45765

leixm · 2024-03-29T03:30:20Z

What changes were proposed in this pull request?

Support codegen for json_tuple.

Why are the changes needed?

Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen..

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

leixm · 2024-03-29T03:32:47Z

@LuciferYang @cloud-fan Can you help review plz?

LuciferYang · 2024-03-29T03:35:22Z

What is the newly added TestCodeGen.java used for?

leixm · 2024-03-29T03:37:57Z

What is the newly added TestCodeGen.java used for?

Sorry, i have deleted it.

LuciferYang · 2024-03-29T05:48:55Z

I haven't looked at the code in detail yet, but I have two questions first:

After this PR, which test cases still cover the non-codegen code branches? The test cases related to json_tuplein org.apache.spark.sql.JsonFunctionsSuite seem to have all been changed to cover the codegen branch.
Can you add a test branch for json_tuple in JsonBenchmark to compare codegen on and codegen off and update the benchmark results? Just like what get_json_object did. (The benchmark result can be obtained by running benchmark.yml with GA.)

https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280

LuciferYang · 2024-04-01T07:47:57Z

@leixm
There are some test failed like

[info] - json_tuple escaping *** FAILED *** (10 milliseconds)
[info]   java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"

leixm · 2024-04-02T03:42:00Z

I haven't looked at the code in detail yet, but I have two questions first:

After this PR, which test cases still cover the non-codegen code branches? The test cases related to json_tuplein org.apache.spark.sql.JsonFunctionsSuite seem to have all been changed to cover the codegen branch.

Can you add a test branch for json_tuple in JsonBenchmark to compare codegen on and codegen off and update the benchmark results? Just like what get_json_object did. (The benchmark result can be obtained by running benchmark.yml with GA.)

https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280

Sure, i have added codegen disable case.

leixm · 2024-04-02T03:47:06Z

I deleted below code, this ut will cause an error(Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"), because GenerateExec will generate code through codeGenIterableOnce in normal scene, and the type of ev.value is IterableOnce.

test("json_tuple escaping") {
    GenerateUnsafeProjection.generate(
      JsonTuple(Literal("\"quote") ::  Literal("\"quote") :: Nil) :: Nil)
  }

leixm · 2024-04-02T03:47:56Z

Benchmark result:

[info] JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Text read                                            66             71           5         15.1          66.1       1.0X
[info] from_json                                          1205           1226          22          0.8        1205.4       0.1X
[info] json_tuple wholestage off                          1562           1604          36          0.6        1562.1       0.0X
[info] json_tuple wholestage on                           1334           1348          12          0.7        1333.9       0.0X
[info] get_json_object wholestage off                     1198           1230          35          0.8        1198.5       0.1X
[info] get_json_object wholestage on                      1217           1238          25          0.8        1216.5       0.1X

leixm · 2024-04-02T09:35:20Z

Seems flaky test.

LuciferYang · 2024-04-02T09:41:53Z

sql/core/benchmarks/JsonBenchmark-results.txt

@@ -3,128 +3,129 @@ Benchmark for performance of JSON parsing
 ================================================================================================

 Preparing data for benchmarking ...
-OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
-AMD EPYC 7763 64-Core Processor
+OpenJDK 64-Bit Server VM 17.0.9+0 on Mac OS X 12.6.7


Please use GitHub Action's machine to generate this file, and also update the result file for Java 21

LuciferYang · 2024-04-02T09:42:57Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

@@ -272,11 +272,6 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
    assert(jt.eval(null).iterator.to(Seq).head === expected)
  }

-  test("json_tuple escaping") {


cc @MaxGekk

leixm · 2024-04-08T07:11:17Z

@LuciferYang @MaxGekk PTAL.

cloud-fan · 2024-04-08T08:58:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple =
+    copy(children = newChildren)
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {


Haven't looked into it yet, but is it possible to make codegen simpler and write most of the code in Scala?

Because we have to consider calculating the foldable expr in advance, which is the reason why doGenCode is bloated. I have tried to simplify the codegen code as much as possible. Do you have any good suggestions?

LuciferYang · 2024-04-11T10:47:21Z

@leixm Sorry, I've been busy with internal matters at the company recently, so it might take me a while to focus on this PR.

cloud-fan · 2024-04-11T13:47:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
      return nullRow
    }

+    val fieldNames = if (constantFields == fieldExpressions.length) {


one idea to simplify the implementation: I think "all constant field names" is the most common case, so we should optimize for it. Mixed case is rather rare and we should just treat it as "no constant field name" to simplify things.

cloud-fan · 2024-04-11T13:49:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+      codeList
+    }
+
+    val splitParseCode = ctx.splitExpressionsWithCurrentInputs(


We don't need to split the method if we don't optimize the mixed case?

[SPARK-47639] Support codegen for json_tuple.

88e5a4b

github-actions bot added the SQL label Mar 29, 2024

fix

87cd1e5

leixm added 2 commits March 29, 2024 11:43

fix.

8535f8d

fix.

25b1d20

fix.

07196b2

fix.

63351a5

LuciferYang reviewed Apr 2, 2024

View reviewed changes

fix.

03c7a15

cloud-fan reviewed Apr 8, 2024

View reviewed changes

leixm requested review from LuciferYang and cloud-fan April 11, 2024 09:58

cloud-fan reviewed Apr 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47639] Support codegen for json_tuple. #45765

[SPARK-47639] Support codegen for json_tuple. #45765

leixm commented Mar 29, 2024

leixm commented Mar 29, 2024

LuciferYang commented Mar 29, 2024

leixm commented Mar 29, 2024

LuciferYang commented Mar 29, 2024 •

edited

LuciferYang commented Apr 1, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

LuciferYang Apr 2, 2024

leixm Apr 2, 2024

LuciferYang Apr 2, 2024

leixm commented Apr 8, 2024

cloud-fan Apr 8, 2024

leixm Apr 8, 2024

LuciferYang commented Apr 11, 2024

cloud-fan Apr 11, 2024

cloud-fan Apr 11, 2024

[SPARK-47639] Support codegen for json_tuple. #45765

Are you sure you want to change the base?

[SPARK-47639] Support codegen for json_tuple. #45765

Conversation

leixm commented Mar 29, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

leixm commented Mar 29, 2024

LuciferYang commented Mar 29, 2024

leixm commented Mar 29, 2024

LuciferYang commented Mar 29, 2024 • edited

LuciferYang commented Apr 1, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

leixm commented Apr 2, 2024

LuciferYang Apr 2, 2024

Choose a reason for hiding this comment

leixm Apr 2, 2024

Choose a reason for hiding this comment

LuciferYang Apr 2, 2024

Choose a reason for hiding this comment

leixm commented Apr 8, 2024

cloud-fan Apr 8, 2024

Choose a reason for hiding this comment

leixm Apr 8, 2024

Choose a reason for hiding this comment

LuciferYang commented Apr 11, 2024

cloud-fan Apr 11, 2024

Choose a reason for hiding this comment

cloud-fan Apr 11, 2024

Choose a reason for hiding this comment

LuciferYang commented Mar 29, 2024 •

edited