New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page #32204
Closed
Closed
[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page #32204
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
3316616
[SPARK-34494] Move Parquet data source options from Python and Scala …
itholic f4d9843
Resolved comments
itholic a7edd06
itemize the options
itholic e3bf606
Resolved comments
itholic b3843c5
Resolved comments
itholic b8b9dc8
remove unnecessary comment
itholic 26b7107
add (keyword argument)
itholic 2379a6d
Remove (keyword argument)
itholic a10586c
Fix wrong link
itholic File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -94,3 +94,168 @@ SELECT * FROM jsonTable | |
</div> | ||
|
||
</div> | ||
|
||
## Data Source Option | ||
|
||
Data source options of JSON can be set via: | ||
* the `.option`/`.options` methods of | ||
* `DataFrameReader` | ||
* `DataFrameWriter` | ||
* `DataStreamReader` | ||
* `DataStreamWriter` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also mention:
|
||
|
||
<table class="table"> | ||
itholic marked this conversation as resolved.
Show resolved
Hide resolved
|
||
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr> | ||
<tr> | ||
<!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. --> | ||
<td><code>timeZone</code></td> | ||
<td>None</td> | ||
<td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br> | ||
<ul> | ||
<li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li> | ||
<li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li> | ||
</ul> | ||
Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config <code>spark.sql.session.timeZone</code> is used by default. | ||
</td> | ||
<td>read/write</td> | ||
</tr> | ||
<tr> | ||
<td><code>primitivesAsString</code></td> | ||
<td>None</td> | ||
<td>Infers all primitive values as a string type. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>prefersDecimal</code></td> | ||
<td>None</td> | ||
<td>Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowComments</code></td> | ||
<td>None</td> | ||
<td>Ignores Java/C++ style comment in JSON records. If None is set, it uses the default value, <code>false</code></td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowUnquotedFieldNames</code></td> | ||
<td>None</td> | ||
<td>Allows unquoted JSON field names. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowSingleQuotes</code></td> | ||
<td>None</td> | ||
<td>Allows single quotes in addition to double quotes. If None is set, it uses the default value, <code>true</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowNumericLeadingZero</code></td> | ||
<td>None</td> | ||
<td>Allows leading zeros in numbers (e.g. 00012). If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowBackslashEscapingAnyCharacter</code></td> | ||
<td>None</td> | ||
<td>Allows accepting quoting of all character using backslash quoting mechanism. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>mode</code></td> | ||
<td>None</td> | ||
<td>Allows a mode for dealing with corrupt records during parsing. If None is set, it uses the default value, <code>PERMISSIVE</code><br> | ||
<ul> | ||
<li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li> | ||
<li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li> | ||
<li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li> | ||
</ul> | ||
</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>columnNameOfCorruptRecord</code></td> | ||
<td>None</td> | ||
<td>Allows renaming the new field having malformed string created by <code>PERMISSIVE</code> mode. This overrides spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>dateFormat</code></td> | ||
<td>None</td> | ||
<td>Sets the string that indicates a date format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to date type. If None is set, it uses the default value, <code>yyyy-MM-dd</code>.</td> | ||
<td>read/write</td> | ||
</tr> | ||
<tr> | ||
<td><code>timestampFormat</code></td> | ||
<td>None</td> | ||
<td>Sets the string that indicates a timestamp format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to timestamp type. If None is set, it uses the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td> | ||
<td>read/write</td> | ||
</tr> | ||
<tr> | ||
<td><code>multiLine</code></td> | ||
<td>None</td> | ||
<td>Parse one record, which may span multiple lines, per file. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowUnquotedControlChars</code></td> | ||
<td>None</td> | ||
<td>Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>encoding</code></td> | ||
<td>None</td> | ||
<td>For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of input JSON will be detected automatically when the multiLine option is set to <code>true</code>. For writing, Specifies encoding (charset) of saved json files. If None is set, the default UTF-8 charset will be used.</td> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also fix the docs properly from |
||
<td>read/write</td> | ||
</tr> | ||
<tr> | ||
<td><code>lineSep</code></td> | ||
<td>None</td> | ||
<td>Defines the line separator that should be used for parsing. If None is set, it covers all <code>\r</code>, <code>\r\n</code> and <code>\n</code>.</td> | ||
<td>read/write</td> | ||
</tr> | ||
<tr> | ||
<td><code>samplingRatio</code></td> | ||
<td>None</td> | ||
<td>Defines fraction of input JSON objects used for schema inferring. If None is set, it uses the default value, <code>1.0</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>dropFieldIfAllNull</code></td> | ||
<td>None</td> | ||
<td>Whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value, <code>false</code>.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>locale</code></td> | ||
<td>None</td> | ||
<td>Sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value, <code>en-US</code>. For instance, <code>locale</code> is used while parsing dates and timestamps.</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>allowNonNumericNumbers</code></td> | ||
<td>None</td> | ||
<td>Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. If None is set, it uses the default value, <code>true</code>.<br> | ||
<ul> | ||
<li><code>+INF</code>: for positive infinity, as well as alias of <code>+Infinity</code> and <code>Infinity</code>.</li> | ||
<li><code>-INF</code>: for negative infinity, alias <code>-Infinity</code>.</li> | ||
<li><code>NaN</code>: for other not-a-numbers, like result of division by zero.</li> | ||
</ul> | ||
</td> | ||
<td>read</td> | ||
</tr> | ||
<tr> | ||
<td><code>compression</code></td> | ||
<td>None</td> | ||
<td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).</td> | ||
<td>write</td> | ||
</tr> | ||
<tr> | ||
<td><code>ignoreNullFields</code></td> | ||
<td>None</td> | ||
<td>Whether to ignore null fields when generating JSON objects. If None is set, it uses the default value, <code>true</code>.</td> | ||
<td>write</td> | ||
</tr> | ||
</table> | ||
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add JSON functions here too