[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

brkyvz · 2019-12-23T17:25:20Z

What changes were proposed in this pull request?

Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

format: The format of the semi-structured column, e.g. json, xml, avro
options: Options for parsing these columns

This PR introduces the function "as_json", which accomplishes this for JSON columns.

Why are the changes needed?

Simplify the handling of semi-structured columns in Spark, initially for JSON data.

Does this PR introduce any user-facing change?

Introduces a new function called as_json

How was this patch tested?

Unit tests

SparkQA · 2019-12-23T23:16:20Z

Test build #115643 has finished for PR 26987 at commit 576af91.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AsJson(json: Expression, formatOptions: Map[String, String], child: Expression)

brkyvz · 2020-01-07T23:31:42Z

cc @marmbrus

github-actions · 2020-08-10T00:36:47Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

brkyvz added 3 commits December 22, 2019 16:44

Introduce SemiStructuredColumn

0b52f9c

add tests

9c720c6

tests pass

576af91

dongjoon-hyun added the SQL label Feb 5, 2020

github-actions bot added the Stale label Aug 10, 2020

github-actions bot closed this Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

brkyvz commented Dec 23, 2019

SparkQA commented Dec 23, 2019

brkyvz commented Jan 7, 2020

github-actions bot commented Aug 10, 2020

[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

Conversation

brkyvz commented Dec 23, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 23, 2019

brkyvz commented Jan 7, 2020

github-actions bot commented Aug 10, 2020