Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 28, 2020

What changes were proposed in this pull request?

Return schema in SQL format instead of Catalog string from the SchemaOfJson expression.

Why are the changes needed?

In some cases, from_json() cannot parse schemas returned by schema_of_json, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by from_json().

Here is the example:

val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")

raises the exception:

== SQL ==
struct<a b:bigint>
------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
	at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537)
	at org.apache.spark.sql.functions$.from_json(functions.scala:4141)

Does this PR introduce any user-facing change?

Yes. For example, schema_of_json for the input {"col":0}.

Before: struct<col:bigint>
After: STRUCT<col: BIGINT>

How was this patch tested?

By existing test suites JsonFunctionsSuite and JsonExpressionsSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 28, 2020

@HyukjinKwon @cloud-fan Could you take a look at this.

@HyukjinKwon
Copy link
Member

+1 from me.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34975/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34975/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130372 has finished for PR 30172 at commit df19069.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34980/

@GeekSheikh
Copy link

This is fantastic! Thanks for the quick fix @MaxGekk

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34980/

@GeekSheikh
Copy link

@MaxGekk -- can we back port this change to Spark 2.4? It doesn't seem like it would need any adaptations

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34982/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34982/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130377 has finished for PR 30172 at commit 718b10e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130379 has finished for PR 30172 at commit 349615d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 29, 2020

@HyukjinKwon @cloud-fan Here are similar changes for CSV: #30180

@MaxGekk MaxGekk deleted the schema_of_json-sql-schema branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants