[SPARK-33299][SQL] Support schema in JSON format by `from_json`/`from_csv` Spark SQL and PySpark by MaxGekk · Pull Request #30201 · apache/spark

MaxGekk · 2020-10-30T08:08:50Z

What changes were proposed in this pull request?

Move schema parsing from from_json in Scala API to the common function in ExprUtils.evalTypeExpr which is used in the JsonToStructs and CsvToStructs expressions, and as a consequence in SQL API, for instance.

Why are the changes needed?

Currently, from_json has different behavior in Scala API and in PySpark/SQL. In Scala API, it accepts schema in JSON format. To improve user experience with Spark SQL and PySpark, this PR proposes to unify the behavior of from_json and from_csv.

Does this PR introduce any user-facing change?

Yes.

Before (in Spark 3.0)

spark-sql> select from_json('{"a":1}', '{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}');
Error in query:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}
^^^

After:

spark-sql> select from_json('{"a":1}', '{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}');
{"a":1}

How was this patch tested?

Added test cases to csv-functions.sql and json-functions.sq

MaxGekk · 2020-10-30T08:11:04Z

@HyukjinKwon @maropu Please, review this PR.

SparkQA · 2020-10-30T08:56:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35054/

SparkQA · 2020-10-30T08:59:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35055/

SparkQA · 2020-10-30T09:20:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35054/

SparkQA · 2020-10-30T09:28:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35055/

zero323 · 2020-10-30T11:24:24Z

We should probably update SparkR (in collection functions doc) and PySpark docs.

maropu · 2020-10-30T11:39:12Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-      fallbackParser = DataType.fromDDL)
-    from_json(e, dataType, options)
+  def from_json(e: Column, schema: String, options: Map[String, String]): Column = withExpr {
+    new JsonToStructs(e.expr, lit(schema).expr, options)


Could you update ExpressionDescription (usage and examples) of CsvToStructs and JsonToStructs?

SparkQA · 2020-10-30T12:58:06Z

Test build #130449 has finished for PR 30201 at commit 44c869c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-10-30T15:58:57Z

@HyukjinKwon already updated python docs :-) https://github.com/apache/spark/pull/18498/files#diff-af2c27f925c4634d7efc25df50dc66b99e5a79f6f897dc4c49302492a48ca3f6R1889

SparkQA · 2020-10-30T16:40:34Z

Test build #130462 has finished for PR 30201 at commit 780834f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-30T17:23:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35067/

SparkQA · 2020-10-30T17:46:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35067/

SparkQA · 2020-10-30T17:49:44Z

Test build #130465 has finished for PR 30201 at commit d91597b.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-10-30T18:00:48Z

I guess SparkR especially handles schema strings. It seems it doesn't support schemas in JSON format. @HyukjinKwon , correct?

spark/R/pkg/R/functions.R

Line 2522 in 3beab8d

jschema <- structType(schema)$jobj

I am going to revert my comments in SparkR.

SparkQA · 2020-10-30T18:32:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35071/

SparkQA · 2020-10-30T19:02:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35071/

SparkQA · 2020-10-30T19:35:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35075/

SparkQA · 2020-10-30T20:06:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35075/

SparkQA · 2020-10-30T22:33:53Z

Test build #130470 has finished for PR 30201 at commit 88cd2a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2020-10-31T10:43:20Z

@HyukjinKwon already updated python docs :-)

I was thinking more about example of using JSON input for schema like the one you've added in
5820120.

But it doesn't seem like there's any user-visible change in Python behavior here after all, is there?

I guess SparkR especially handles schema strings.

Yes, my bad.

HyukjinKwon · 2020-10-31T11:02:27Z

Thanks @MaxGekk for catching this. Will take a look tomorrow.

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

python/pyspark/sql/functions.py

HyukjinKwon · 2020-11-01T10:53:07Z

Okay, I have thought a while + had a offline discussion with @MaxGekk. Looks like we should better just fix the doc in PySpark, and don't support JSON format in PySpark, SQL, and SparkR for now. We can keep it in Scala side for a legacy reason.

The reason is that, I do think JSON format is sort of internal, and we should better promote DDL string types instead. I haven't heard any complaint about JSON string support so far (surprisingly).

MaxGekk · 2020-11-02T11:33:05Z

Please, review this #30226

MaxGekk added 4 commits October 30, 2020 10:18

Move JSON schema parsing to ExprUtils

c0bf383

Re-gen csv-functions.sql.out and json-functions.sql.out

dc8b60e

Add a test to json-functions.sql

a6d9eb7

Add a test to csv-functions.sql

b0081ab

MaxGekk changed the title ~~[SPARK-33299][SQL] Support schema in JSON format in from_json/from_csv in all APIs~~ [SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv across all APIs Oct 30, 2020

MaxGekk mentioned this pull request Oct 30, 2020

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406

Closed

Remove unused import

44c869c

maropu reviewed Oct 30, 2020

View reviewed changes

MaxGekk added 2 commits October 30, 2020 18:44

Update examples for JsonToStructs

8596356

Update examples for CsvToStructs

d5dae94

MaxGekk added 2 commits October 30, 2020 19:08

Add an example to PySpark

5820120

Update R API

780834f

Fix python coding style

d91597b

Revert changes in SparkR

88cd2a3

MaxGekk changed the title ~~[SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv across all APIs~~ [SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark Oct 30, 2020

HyukjinKwon reviewed Nov 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Show resolved Hide resolved

HyukjinKwon reviewed Nov 1, 2020

View reviewed changes

python/pyspark/sql/functions.py Show resolved Hide resolved

MaxGekk mentioned this pull request Nov 2, 2020

[SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for from_json #30226

Closed

dongjoon-hyun closed this in bdabf60 Nov 2, 2020

MaxGekk deleted the from_json-common-schema-parsing branch December 11, 2020 20:28

Conversation

MaxGekk commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

zero323 commented Oct 30, 2020

Uh oh!

maropu Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

MaxGekk commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

MaxGekk commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

SparkQA commented Oct 30, 2020

Uh oh!

zero323 commented Oct 31, 2020

Uh oh!

HyukjinKwon commented Oct 31, 2020

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Nov 1, 2020

Uh oh!

MaxGekk commented Nov 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Oct 30, 2020 •

edited

Loading

MaxGekk commented Oct 30, 2020 •

edited

Loading