Skip to content

[SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark#30201

Closed
MaxGekk wants to merge 11 commits intoapache:masterfrom
MaxGekk:from_json-common-schema-parsing
Closed

[SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark#30201
MaxGekk wants to merge 11 commits intoapache:masterfrom
MaxGekk:from_json-common-schema-parsing

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 30, 2020

What changes were proposed in this pull request?

Move schema parsing from from_json in Scala API to the common function in ExprUtils.evalTypeExpr which is used in the JsonToStructs and CsvToStructs expressions, and as a consequence in SQL API, for instance.

Why are the changes needed?

Currently, from_json has different behavior in Scala API and in PySpark/SQL. In Scala API, it accepts schema in JSON format. To improve user experience with Spark SQL and PySpark, this PR proposes to unify the behavior of from_json and from_csv.

Does this PR introduce any user-facing change?

Yes.

Before (in Spark 3.0)

spark-sql> select from_json('{"a":1}', '{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}');
Error in query:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}
^^^

After:

spark-sql> select from_json('{"a":1}', '{"type" : "map", "keyType" : "string", "valueType" : "integer", "valueContainsNull" : true}');
{"a":1}

How was this patch tested?

Added test cases to csv-functions.sql and json-functions.sq

@MaxGekk MaxGekk changed the title [SPARK-33299][SQL] Support schema in JSON format in from_json/from_csv in all APIs [SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv across all APIs Oct 30, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 30, 2020

@HyukjinKwon @maropu Please, review this PR.

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35054/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35055/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35054/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35055/

@zero323
Copy link
Member

zero323 commented Oct 30, 2020

We should probably update SparkR (in collection functions doc) and PySpark docs.

fallbackParser = DataType.fromDDL)
from_json(e, dataType, options)
def from_json(e: Column, schema: String, options: Map[String, String]): Column = withExpr {
new JsonToStructs(e.expr, lit(schema).expr, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update ExpressionDescription (usage and examples) of CsvToStructs and JsonToStructs?

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130449 has finished for PR 30201 at commit 44c869c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 30, 2020

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130462 has finished for PR 30201 at commit 780834f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35067/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35067/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130465 has finished for PR 30201 at commit d91597b.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 30, 2020

I guess SparkR especially handles schema strings. It seems it doesn't support schemas in JSON format. @HyukjinKwon , correct?

jschema <- structType(schema)$jobj

I am going to revert my comments in SparkR.

@MaxGekk MaxGekk changed the title [SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv across all APIs [SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark Oct 30, 2020
@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35071/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35071/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35075/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35075/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130470 has finished for PR 30201 at commit 88cd2a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member

zero323 commented Oct 31, 2020

@HyukjinKwon already updated python docs :-)

I was thinking more about example of using JSON input for schema like the one you've added in
5820120.

But it doesn't seem like there's any user-visible change in Python behavior here after all, is there?

I guess SparkR especially handles schema strings.

Yes, my bad.

@HyukjinKwon
Copy link
Member

Thanks @MaxGekk for catching this. Will take a look tomorrow.

@HyukjinKwon
Copy link
Member

Okay, I have thought a while + had a offline discussion with @MaxGekk. Looks like we should better just fix the doc in PySpark, and don't support JSON format in PySpark, SQL, and SparkR for now. We can keep it in Scala side for a legacy reason.

The reason is that, I do think JSON format is sort of internal, and we should better promote DDL string types instead. I haven't heard any complaint about JSON string support so far (surprisingly).

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 2, 2020

Please, review this #30226

@MaxGekk MaxGekk deleted the from_json-common-schema-parsing branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants