[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

MaxGekk · 2018-07-01T10:03:07Z

What changes were proposed in this pull request?

In the PR, I propose to add new function - schema_of_json() which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.

One of the use cases is using of schema_of_json() in the combination with from_json(). Currently, from_json() requires a schema as a mandatory argument. The schema_of_json() function will allow to point out an JSON string as an example which has the same schema as the first argument of from_json(). For instance:

select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;

How was this patch tested?

Added new test to JsonFunctionsSuite, JsonExpressionsSuite and SQL tests to json-functions.sql

SparkQA · 2018-07-01T10:10:16Z

Test build #92510 has finished for PR 21686 at commit 2ff71e8.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-01T14:09:28Z

Test build #92511 has finished for PR 21686 at commit 56c925d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-02T07:34:54Z

@rxin, does this look okay to you? If so will check closely and get this in.

rxin · 2018-07-02T19:57:56Z

Does this actually work in SQL? How does it work when we don't have a data type that's a schema?

MaxGekk · 2018-07-02T20:10:54Z

Does this actually work in SQL?

Yes, it does. Please, have a look at the SQL test:
https://github.com/apache/spark/pull/21686/files#diff-3b8a538abd658a260aa32c4aa593bed7R41

How does it work when we don't have a data type that's a schema?

We recently supported schema as any data type in DDL format: #21550 . The new function uses the same mechanism as we use for schema inferring when we read files. So, it can infer and return any data type (but not MapType for now) as string in DDL format. And from_json() can accept it due to the #21550 . @rxin If I didn't answer to your question, please, clarify what do you mean.

rxin · 2018-07-03T00:53:30Z

Thanks. Awesome. This matches what I had in mind then.

HyukjinKwon · 2018-07-03T14:22:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  extends UnaryExpression with String2StringExpression with CodegenFallback {
+
+  private val jsonOptions = new JSONOptions(Map.empty, "UTC")
+  private val jsonFactory = new JsonFactory()


Seems jsonOptions.setJacksonOptions(factory) is missed.

Thank you for pointing this out. I really didn't know that I have to call the method.

HyukjinKwon · 2018-07-03T14:23:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+      DataType.fromDDL(ddlSchema.toString)
+    case e => throw new AnalysisException(
+      "Schema should be specified in DDL format as a string literal" +
+      s" or output of the schema_of_json function instead of $e")


minor nit: schema_of_json -> exp.prettyName

exp contains a schema or something like that. Maybe do you mean $e -> ${e.prettyName}?

HyukjinKwon · 2018-07-03T15:07:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+    Examples:
+      > SELECT _FUNC_('[{"col":0}]');
+       array<struct<col:int>>
+  """)


HyukjinKwon · 2018-07-03T15:43:36Z

python/pyspark/sql/functions.py

+    >>> df = spark.createDataFrame(data, ("key", "value"))
+    >>> df.select(schema_of_json(df.value).alias("json")).collect()
+    [Row(json=u'struct<a:bigint>')]
+    >>> df.select(schema_of_json(lit('''{"a": 0}''')).alias("json")).collect()


minor nit '''{"a": 0}''' -> '{"a": 0}'

HyukjinKwon · 2018-07-03T15:45:44Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 2.4.0
+   */
+  def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {
+    withExpr {new JsonToStructs(e.expr, schema.expr, options.asScala.toMap)}


{n -> { n or withExpr(

Maybe just withExpr(new JsonToStructs(...))?

HyukjinKwon · 2018-07-03T15:47:05Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+  }
+
+  /**
+   * (Scala-specific) Parses a column containing a JSON string into a `MapType` with `StringType`


Is it Scala specific or Java specific?

I am not sure about this note at all. Why should it be Java or Scala specific?
I will change it to Java-specific to have it in the same style as other comments.

Yup. That's because Java users will more likely use Java's collections rather than Scala's collection which works weirdly in Java side.

Actually, can we remove this version for now and add it when it's requested? There is a concern about this file getting too long and from_json has so many variants.

Maybe we will convert object functions to package object functions and put json related methods to a separate file. This functions object becomes really big.

HyukjinKwon · 2018-07-03T17:00:37Z

Seems fine to me otherwise.

HyukjinKwon · 2018-07-03T17:27:53Z

python/pyspark/sql/functions.py

@@ -2189,11 +2189,16 @@ def from_json(col, schema, options={}):
    >>> df = spark.createDataFrame(data, ("key", "value"))
    >>> df.select(from_json(df.value, schema).alias("json")).collect()
    [Row(json=[Row(a=1)])]
+    >>> schema = schema_of_json(lit('''{"a": 0}'''))


nit: '''{"a": 0}''' -> '{"a": 0}'

feel free to fix other examples above too

Do you mean for other functions too?

Nope, I mean the examples here in this function.

HyukjinKwon · 2018-07-03T17:28:37Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group collection_funcs
+   * @since 2.4.0
+   */
+  def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {


Let me leave my last comment, #21686 (comment) in case it's missed.

we call the method from python: https://github.com/apache/spark/pull/21686/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2202

Do you really want to revert changes for python?

Ah, I see. I am fine then. Thanks.

HyukjinKwon · 2018-07-03T17:29:10Z

python/pyspark/sql/functions.py

+    :param col: string column in json format
+
+    >>> from pyspark.sql.types import *
+    >>> data = [(1, '''{"a": 1}''')]


HyukjinKwon

LGTM otherwise

SparkQA · 2018-07-03T19:52:54Z

Test build #92575 has finished for PR 21686 at commit c993fd1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-04T00:46:32Z

Test build #92578 has finished for PR 21686 at commit 86f6886.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-04T01:24:39Z

Test build #92580 has finished for PR 21686 at commit dc35731.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-04T01:37:55Z

Merged to master.

In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format. One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance: ```sql select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}')) from json_table; ``` Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes apache#21686 from MaxGekk/infer_schema_json.

MaxGekk added 10 commits June 30, 2018 20:12

Implemented new expression - SchemaOfJson

891f3ce

Fix imports

26f3275

from_json() accepts output of schema_of_json()

1848a7a

Added sql tests

42da3f2

New functions - from_json which accepts a Column and schema_of_json

97d93b3

Tests for new functions

ab82bd8

Fix for json functions suite

174f8ab

Added the schema_of_json() function to PySpark

d77ed45

Fix schema_of_json() in PySpark

086f6c1

Adding ticket's number to test titles.

2ff71e8

MaxGekk added 2 commits July 1, 2018 12:13

Making python style checker happy - added blank lines

064bc5c

Creating json parser in out of the method.

56c925d

HyukjinKwon reviewed Jul 3, 2018

View reviewed changes

Addressing Hyukjin Kwon's review comments

c993fd1

HyukjinKwon reviewed Jul 3, 2018

View reviewed changes

HyukjinKwon approved these changes Jul 3, 2018

View reviewed changes

MaxGekk added 2 commits July 3, 2018 22:46

Fix sql test

86f6886

Removing unnecessary quotes.

dc35731

asfgit closed this in 776f299 Jul 4, 2018

MaxGekk deleted the infer_schema_json branch August 17, 2019 13:34

nchammas mentioned this pull request Mar 3, 2020

[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only #22775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

MaxGekk commented Jul 1, 2018 •

edited

Loading

SparkQA commented Jul 1, 2018

SparkQA commented Jul 1, 2018

HyukjinKwon commented Jul 2, 2018

rxin commented Jul 2, 2018

MaxGekk commented Jul 2, 2018

rxin commented Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018 •

edited

Loading

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon commented Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

MaxGekk Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon Jul 3, 2018

HyukjinKwon left a comment

SparkQA commented Jul 3, 2018

SparkQA commented Jul 4, 2018

SparkQA commented Jul 4, 2018

HyukjinKwon commented Jul 4, 2018

[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

Conversation

MaxGekk commented Jul 1, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 1, 2018

SparkQA commented Jul 1, 2018

HyukjinKwon commented Jul 2, 2018

rxin commented Jul 2, 2018

MaxGekk commented Jul 2, 2018

rxin commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 3, 2018

SparkQA commented Jul 4, 2018

SparkQA commented Jul 4, 2018

HyukjinKwon commented Jul 4, 2018

MaxGekk commented Jul 1, 2018 •

edited

Loading

HyukjinKwon Jul 3, 2018 •

edited

Loading