Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11745][SQL] Enable more JSON parsing options #9724

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Nov 16, 2015

This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:

  • allowComments (default false): ignores Java/C++ style comment in JSON records
  • allowUnquotedFieldNames (default false): allows unquoted JSON field names
  • allowSingleQuotes (default true): allows single quotes in addition to double quotes
  • allowNumericLeadingZeros (default false): allows leading zeros in numbers (e.g. 00012)

To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.

Also updated documentation to explain these options.

Scala

screen shot 2015-11-15 at 6 12 12 pm

Python

screen shot 2015-11-15 at 6 11 28 pm

@@ -221,22 +221,6 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ

private[this] def isTesting: Boolean = sys.props.contains("spark.testing")

protected def newProjection(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now unused.

* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add samplingRatio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we skipped it in the past because it had very little impact on performance, so in most cases it is better to just use 1.0... Maybe we should even deprecate that option.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #2061 has finished for PR 9724 at commit 00cfc19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45972 has finished for PR 9724 at commit 00cfc19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor Author

rxin commented Nov 16, 2015

Alright I've updated it.

@yhuai
Copy link
Contributor

yhuai commented Nov 16, 2015

LGTM pending jenkins.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45981 has finished for PR 9724 at commit d8ca56d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class JSONOptions(\n

@rxin
Copy link
Contributor Author

rxin commented Nov 16, 2015

Thanks - I'm merging this in.

asfgit pushed a commit that referenced this pull request Nov 16, 2015
This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)

To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.

Also updated documentation to explain these options.

Scala

![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)

Python

![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)

Author: Reynold Xin <rxin@databricks.com>

Closes #9724 from rxin/SPARK-11745.

(cherry picked from commit 42de525)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 42de525 Nov 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants