-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results #38784
Conversation
"when one or more fields do not match the schema") | ||
.version("3.4.0") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still debating whether to keep this as true
or false
. On one hand, when enabled, it fixes the correctness issue of partially parsing JSON records. When disabled, it could cause performance issues (impact is yet to be confirmed).
@MaxGekk I would love to know your thoughts on this. I would prefer to keep it enabled until the benchmark is produced that shows the regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you re-gen results of JsonBenchmark
. If there are no significant diffs, I am ok to enable it. Also, it would be nice to add one more benchmark for this particular case of partial results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sadikovi Could you enable it by default since there is no regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will enable the flag by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have enabled the flag by default.
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping, @sadikovi .
Sorry, I was working on another issue. I will address the comments today/tomorrow. |
Benchmark results for when the config It seems the results are fairly close but pushdown with filters appears to be faster when the config is enabled which is surprising. I reran that benchmark several times and confirm that the numbers reported are fairly accurate. |
2e9dcd0
to
cf2d583
Compare
+1, LGTM. Merging to master. |
Awesome, thank you! |
…e/disable JSON partial results ### What changes were proposed in this pull request? This PR adds a SQL config `spark.sql.json.enablePartialResults` to control SPARK-40646 change. This allows us to fall back to the behaviour before the change. It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing. Benchmark results are attached to the JIRA ticket. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? SQL config `spark.sql.json.enablePartialResults` is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files. ### How was this patch tested? I extended existing unit tests to test with flag enabled and disabled. Closes apache#38784 from sadikovi/add-flag-json-parsing. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This PR adds a SQL config
spark.sql.json.enablePartialResults
to control SPARK-40646 change. This allows us to fall back to the behaviour before the change.It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with Apache Spark JSON benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing.
Benchmark results are attached to the JIRA ticket.
Why are the changes needed?
Does this PR introduce any user-facing change?
SQL config
spark.sql.json.enablePartialResults
is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files.How was this patch tested?
I extended existing unit tests to test with flag enabled and disabled.