Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30334][SQL] Introduce as_json for marking a column as JSON data #26987

Closed
wants to merge 3 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Dec 23, 2019

What changes were proposed in this pull request?

Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

  • format: The format of the semi-structured column, e.g. json, xml, avro
  • options: Options for parsing these columns

This PR introduces the function "as_json", which accomplishes this for JSON columns.

Why are the changes needed?

Simplify the handling of semi-structured columns in Spark, initially for JSON data.

Does this PR introduce any user-facing change?

Introduces a new function called as_json

How was this patch tested?

Unit tests

@SparkQA
Copy link

SparkQA commented Dec 23, 2019

Test build #115643 has finished for PR 26987 at commit 576af91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AsJson(json: Expression, formatOptions: Map[String, String], child: Expression)

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 7, 2020

cc @marmbrus

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 10, 2020
@github-actions github-actions bot closed this Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants