[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

sadikovi · 2022-11-24T05:14:36Z

What changes were proposed in this pull request?

This PR adds a SQL config spark.sql.json.enablePartialResults to control SPARK-40646 change. This allows us to fall back to the behaviour before the change.

It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing.

Benchmark results are attached to the JIRA ticket.

Why are the changes needed?

Does this PR introduce any user-facing change?

SQL config spark.sql.json.enablePartialResults is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files.

How was this patch tested?

I extended existing unit tests to test with flag enabled and disabled.

sadikovi · 2022-11-24T05:16:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "when one or more fields do not match the schema")
+      .version("3.4.0")
+      .booleanConf
+      .createWithDefault(false)


I am still debating whether to keep this as true or false. On one hand, when enabled, it fixes the correctness issue of partially parsing JSON records. When disabled, it could cause performance issues (impact is yet to be confirmed).

@MaxGekk I would love to know your thoughts on this. I would prefer to keep it enabled until the benchmark is produced that shows the regression.

Could you re-gen results of JsonBenchmark. If there are no significant diffs, I am ok to enable it. Also, it would be nice to add one more benchmark for this particular case of partial results.

Yes, I will do that, thanks!

@sadikovi Could you enable it by default since there is no regression.

Yes, I will enable the flag by default.

Done. I have enabled the flag by default.

AmplabJenkins · 2022-11-27T15:20:25Z

Can one of the admins verify this patch?

dongjoon-hyun

Gentle ping, @sadikovi .

sadikovi · 2022-12-11T21:23:58Z

Sorry, I was working on another issue. I will address the comments today/tomorrow.

sadikovi · 2022-12-12T22:29:07Z

Benchmark results for when the config spark.sql.json.enablePartialResults is enabled and disabled (txt files).
JsonBenchmark results with config as false
JsonBenchmark results with config as true

It seems the results are fairly close but pushdown with filters appears to be faster when the config is enabled which is surprising. I reran that benchmark several times and confirm that the numbers reported are fairly accurate.

MaxGekk · 2022-12-14T05:27:50Z

+1, LGTM. Merging to master.
Thank you, @sadikovi.

dongjoon-hyun · 2022-12-14T05:38:17Z

Thank you, @sadikovi and @MaxGekk !

sadikovi · 2022-12-14T06:01:23Z

Awesome, thank you!

…e/disable JSON partial results ### What changes were proposed in this pull request? This PR adds a SQL config `spark.sql.json.enablePartialResults` to control SPARK-40646 change. This allows us to fall back to the behaviour before the change. It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing. Benchmark results are attached to the JIRA ticket. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? SQL config `spark.sql.json.enablePartialResults` is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files. ### How was this patch tested? I extended existing unit tests to test with flag enabled and disabled. Closes apache#38784 from sadikovi/add-flag-json-parsing. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions bot added the SQL label Nov 24, 2022

sadikovi commented Nov 24, 2022

View reviewed changes

sadikovi changed the title ~~[SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646~~ [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646 Nov 24, 2022

dongjoon-hyun reviewed Dec 9, 2022

View reviewed changes

add config flag

cf2d583

sadikovi force-pushed the add-flag-json-parsing branch from 2e9dcd0 to cf2d583 Compare December 12, 2022 22:32

update benchmark results

46a952b

sadikovi requested review from MaxGekk and dongjoon-hyun and removed request for MaxGekk and dongjoon-hyun December 12, 2022 22:33

enable by default

d85134c

MaxGekk approved these changes Dec 14, 2022

View reviewed changes

MaxGekk changed the title ~~[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646~~ [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results Dec 14, 2022

MaxGekk closed this in 5b50834 Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

sadikovi commented Nov 24, 2022 •

edited

Loading

sadikovi Nov 24, 2022

MaxGekk Nov 24, 2022

sadikovi Nov 24, 2022

MaxGekk Dec 13, 2022

sadikovi Dec 13, 2022

sadikovi Dec 13, 2022

AmplabJenkins commented Nov 27, 2022

dongjoon-hyun left a comment

sadikovi commented Dec 11, 2022 •

edited

Loading

sadikovi commented Dec 12, 2022

MaxGekk commented Dec 14, 2022

dongjoon-hyun commented Dec 14, 2022

sadikovi commented Dec 14, 2022

[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

Conversation

sadikovi commented Nov 24, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sadikovi Nov 24, 2022

Choose a reason for hiding this comment

MaxGekk Nov 24, 2022

Choose a reason for hiding this comment

sadikovi Nov 24, 2022

Choose a reason for hiding this comment

MaxGekk Dec 13, 2022

Choose a reason for hiding this comment

sadikovi Dec 13, 2022

Choose a reason for hiding this comment

sadikovi Dec 13, 2022

Choose a reason for hiding this comment

AmplabJenkins commented Nov 27, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

sadikovi commented Dec 11, 2022 • edited Loading

sadikovi commented Dec 12, 2022

MaxGekk commented Dec 14, 2022

dongjoon-hyun commented Dec 14, 2022

sadikovi commented Dec 14, 2022

sadikovi commented Nov 24, 2022 •

edited

Loading

sadikovi commented Dec 11, 2022 •

edited

Loading