Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26310][SQL] Verify applicability of JSON options #23257

Closed
wants to merge 5 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 7, 2018

What changes were proposed in this pull request?

In the PR, I propose additional verification of JSON options. In particular, detection of using read options in write and vice versa. The verification can be disabled by new SQL config spark.sql.verifyDataSourceOptions (true by default). This changes should prevent misusing of JSON options, and improve usage experience of built-in datasource.

How was this patch tested?

Added new test to JacksonGeneratorSuite to check how in read option is handled if it is passed in write. Also the test checks new SQL config can disable the verification.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 7, 2018

@gatorsmile Please, have a look at this PR.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99844 has finished for PR 23257 at commit af15070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 8, 2018

@cloud-fan What do you think of the PR?

@SparkQA
Copy link

SparkQA commented Dec 9, 2018

Test build #99881 has finished for PR 23257 at commit 9ee62ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 15, 2018

@gatorsmile Should I close the PR or will continue with this approach?

@SparkQA
Copy link

SparkQA commented Dec 16, 2018

Test build #100204 has finished for PR 23257 at commit 06a0df1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 22, 2018

@gatorsmile Could you look at the PR, please.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 23, 2018

@HyukjinKwon What do you think of this PR?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object to the change. My guts say this should be more generalised but on the other hand it looks an overkill. The configuration also looks an overkill - we could warn instead but on the other hand I get why it's needed.

Also, I'm not still sure about ..OptionsRead and ..OptionsWrite but I don't object as I said.

def checkOptions(where: String): Unit = {
val wrongOptions = notApplicableOptions.filter(parameters.contains(_))
if (!wrongOptions.isEmpty && SQLConf.get.verifyDataSourceOptions) {
// scalastyle:off throwerror
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this.

@@ -172,3 +189,34 @@ private[sql] object JSONOptionsInRead {
Charset.forName("UTF-32")
)
}

private[sql] class JSONOptionsInWrite(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't also need private[sql] since we're already in a private package.

val wrongOptions = notApplicableOptions.filter(parameters.contains(_))
if (!wrongOptions.isEmpty && SQLConf.get.verifyDataSourceOptions) {
// scalastyle:off throwerror
throw new IllegalArgumentException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider to warn when verifyDataSourceOptions is disabled.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

@HyukjinKwon Thank you for looking at this PR. I proposed the changes because I saw a few times already when our users tried to apply datasource specific options in wrong way. Maybe this is because of Spark's documentation is not clear enough, and we should point out more precisely when particular option can be applicable. ... but I believe it makes sense to solve the problem in code.

... more generalised but on the other hand it looks an overkill

I considered more generalized approach like creating a trait for JSONOptions, CSVOptions, TextOptions, AvroOptions and etc. like. The trait could contains applicable options for each direction (in read/in write). In this way, a caller could ask explicitly (or implicitly from the constructor) to verify passed options.

The configuration also looks an overkill ...

It was introduced only because of behavior change.

... we could warn instead but on the other hand I get why it's needed

Logs are not always available to users or can be disabled. And who among users read logs ;-) . I prefer to throw an exception because it can safe user's time in debugging/trouble shooting. I cannot imagine a situation when in write options are passed as read options in purpose, and silent behavior could be expected.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 6, 2019

@gatorsmile Could you look at this PR, please.

@HyukjinKwon
Copy link
Member

Im okie. Let me leave it to @gatorsmile and other committees.

@@ -1635,6 +1635,14 @@ object SQLConf {
"java.time.* packages are used for the same purpose.")
.booleanConf
.createWithDefault(false)

val VERIFY_DATASOURCE_OPTIONS = buildConf("spark.sql.verifyDataSourceOptions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a config for this? seems more like a bug fix, even if it's a behavior change, and we can be stricter about validation in Spark 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the config but @gatorsmile asked to add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can't say I feel too strongly about it, but when would this be set to false @gatorsmile ? this is a case where the user is already setting useless options right, so they can a) somehow discover this flag or b) just remove the options. b) seems much more likely to happen. Is the worry about legacy code or libraries that can't be easily modified? those already may not work on Spark 3.

@SparkQA
Copy link

SparkQA commented Jan 11, 2019

Test build #101099 has finished for PR 23257 at commit e1db3ed.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 11, 2019

I am closing this since I don't see enthusiasm about the changes.

@MaxGekk MaxGekk closed this Jan 11, 2019
@MaxGekk MaxGekk deleted the sep-options-in-save-load branch August 17, 2019 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants