Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784

Closed
wants to merge 3 commits into from

Conversation

sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Nov 24, 2022

What changes were proposed in this pull request?

This PR adds a SQL config spark.sql.json.enablePartialResults to control SPARK-40646 change. This allows us to fall back to the behaviour before the change.

It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing.

Benchmark results are attached to the JIRA ticket.

Why are the changes needed?

Does this PR introduce any user-facing change?

SQL config spark.sql.json.enablePartialResults is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files.

How was this patch tested?

I extended existing unit tests to test with flag enabled and disabled.

@github-actions github-actions bot added the SQL label Nov 24, 2022
"when one or more fields do not match the schema")
.version("3.4.0")
.booleanConf
.createWithDefault(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still debating whether to keep this as true or false. On one hand, when enabled, it fixes the correctness issue of partially parsing JSON records. When disabled, it could cause performance issues (impact is yet to be confirmed).

@MaxGekk I would love to know your thoughts on this. I would prefer to keep it enabled until the benchmark is produced that shows the regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you re-gen results of JsonBenchmark. If there are no significant diffs, I am ok to enable it. Also, it would be nice to add one more benchmark for this particular case of partial results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do that, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadikovi Could you enable it by default since there is no regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will enable the flag by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have enabled the flag by default.

@sadikovi sadikovi changed the title [SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646 [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646 Nov 24, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping, @sadikovi .

@sadikovi
Copy link
Contributor Author

sadikovi commented Dec 11, 2022

Sorry, I was working on another issue. I will address the comments today/tomorrow.

@sadikovi
Copy link
Contributor Author

Benchmark results for when the config spark.sql.json.enablePartialResults is enabled and disabled (txt files).
JsonBenchmark results with config as false
JsonBenchmark results with config as true

It seems the results are fairly close but pushdown with filters appears to be faster when the config is enabled which is surprising. I reran that benchmark several times and confirm that the numbers reported are fairly accurate.

@sadikovi sadikovi requested review from MaxGekk and dongjoon-hyun and removed request for MaxGekk and dongjoon-hyun December 12, 2022 22:33
@MaxGekk MaxGekk changed the title [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646 [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results Dec 14, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Dec 14, 2022

+1, LGTM. Merging to master.
Thank you, @sadikovi.

@MaxGekk MaxGekk closed this in 5b50834 Dec 14, 2022
@dongjoon-hyun
Copy link
Member

Thank you, @sadikovi and @MaxGekk !

@sadikovi
Copy link
Contributor Author

Awesome, thank you!

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…e/disable JSON partial results

### What changes were proposed in this pull request?

This PR adds a SQL config `spark.sql.json.enablePartialResults` to control SPARK-40646 change. This allows us to fall back to the behaviour before the change.

It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing.

Benchmark results are attached to the JIRA ticket.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

SQL config `spark.sql.json.enablePartialResults` is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files.

### How was this patch tested?

I extended existing unit tests to test with flag enabled and disabled.

Closes apache#38784 from sadikovi/add-flag-json-parsing.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants