Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47639] Support codegen for json_tuple. #45765

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

leixm
Copy link

@leixm leixm commented Mar 29, 2024

What changes were proposed in this pull request?

Support codegen for json_tuple.

Why are the changes needed?

Sometimes using json_tuple may cause performance regression because it does not support whole stage codegen..

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 29, 2024
@leixm
Copy link
Author

leixm commented Mar 29, 2024

@LuciferYang @cloud-fan Can you help review plz?

@LuciferYang
Copy link
Contributor

What is the newly added TestCodeGen.java used for?

@leixm
Copy link
Author

leixm commented Mar 29, 2024

What is the newly added TestCodeGen.java used for?

Sorry, i have deleted it.

@LuciferYang
Copy link
Contributor

LuciferYang commented Mar 29, 2024

I haven't looked at the code in detail yet, but I have two questions first:

  1. After this PR, which test cases still cover the non-codegen code branches? The test cases related to json_tuplein org.apache.spark.sql.JsonFunctionsSuite seem to have all been changed to cover the codegen branch.

  2. Can you add a test branch for json_tuple in JsonBenchmark to compare codegen on and codegen off and update the benchmark results? Just like what get_json_object did. (The benchmark result can be obtained by running benchmark.yml with GA.)

https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280

image

@LuciferYang
Copy link
Contributor

@leixm
There are some test failed like

[info] - json_tuple escaping *** FAILED *** (10 milliseconds)
[info]   java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 48, Column 30: Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"

@leixm
Copy link
Author

leixm commented Apr 2, 2024

I haven't looked at the code in detail yet, but I have two questions first:

  1. After this PR, which test cases still cover the non-codegen code branches? The test cases related to json_tuplein org.apache.spark.sql.JsonFunctionsSuite seem to have all been changed to cover the codegen branch.
  2. Can you add a test branch for json_tuple in JsonBenchmark to compare codegen on and codegen off and update the benchmark results? Just like what get_json_object did. (The benchmark result can be obtained by running benchmark.yml with GA.)

https://github.com/apache/spark/blob/a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala#L270C24-L280

image

Sure, i have added codegen disable case.

@leixm
Copy link
Author

leixm commented Apr 2, 2024

I deleted below code, this ut will cause an error(Assignment conversion not possible from type "scala.collection.IterableOnce" to type "org.apache.spark.sql.catalyst.util.ArrayData"), because GenerateExec will generate code through codeGenIterableOnce in normal scene, and the type of ev.value is IterableOnce.

test("json_tuple escaping") {
    GenerateUnsafeProjection.generate(
      JsonTuple(Literal("\"quote") ::  Literal("\"quote") :: Nil) :: Nil)
  }

@leixm
Copy link
Author

leixm commented Apr 2, 2024

Benchmark result:

[info] JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Text read                                            66             71           5         15.1          66.1       1.0X
[info] from_json                                          1205           1226          22          0.8        1205.4       0.1X
[info] json_tuple wholestage off                          1562           1604          36          0.6        1562.1       0.0X
[info] json_tuple wholestage on                           1334           1348          12          0.7        1333.9       0.0X
[info] get_json_object wholestage off                     1198           1230          35          0.8        1198.5       0.1X
[info] get_json_object wholestage on                      1217           1238          25          0.8        1216.5       0.1X

@leixm
Copy link
Author

leixm commented Apr 2, 2024

Seems flaky test.

@@ -3,128 +3,129 @@ Benchmark for performance of JSON parsing
================================================================================================

Preparing data for benchmarking ...
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Linux 6.5.0-1016-azure
AMD EPYC 7763 64-Core Processor
OpenJDK 64-Bit Server VM 17.0.9+0 on Mac OS X 12.6.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use GitHub Action's machine to generate this file, and also update the result file for Java 21

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -272,11 +272,6 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
assert(jt.eval(null).iterator.to(Seq).head === expected)
}

test("json_tuple escaping") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leixm
Copy link
Author

leixm commented Apr 8, 2024

@LuciferYang @MaxGekk PTAL.

override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): JsonTuple =
copy(children = newChildren)

override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into it yet, but is it possible to make codegen simpler and write most of the code in Scala?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have to consider calculating the foldable expr in advance, which is the reason why doGenCode is bloated. I have tried to simplify the codegen code as much as possible. Do you have any good suggestions?

@LuciferYang
Copy link
Contributor

@leixm Sorry, I've been busy with internal matters at the company recently, so it might take me a while to focus on this PR.

@@ -501,55 +503,156 @@ case class JsonTuple(children: Seq[Expression])
return nullRow
}

val fieldNames = if (constantFields == fieldExpressions.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one idea to simplify the implementation: I think "all constant field names" is the most common case, so we should optimize for it. Mixed case is rather rare and we should just treat it as "no constant field name" to simplify things.

codeList
}

val splitParseCode = ctx.splitExpressionsWithCurrentInputs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to split the method if we don't optimize the mixed case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants