Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686

Closed
wants to merge 15 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 1, 2018

What changes were proposed in this pull request?

In the PR, I propose to add new function - schema_of_json() which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.

One of the use cases is using of schema_of_json() in the combination with from_json(). Currently, from_json() requires a schema as a mandatory argument. The schema_of_json() function will allow to point out an JSON string as an example which has the same schema as the first argument of from_json(). For instance:

select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;

How was this patch tested?

Added new test to JsonFunctionsSuite, JsonExpressionsSuite and SQL tests to json-functions.sql

@SparkQA
Copy link

SparkQA commented Jul 1, 2018

Test build #92510 has finished for PR 21686 at commit 2ff71e8.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2018

Test build #92511 has finished for PR 21686 at commit 56c925d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@rxin, does this look okay to you? If so will check closely and get this in.

@rxin
Copy link
Contributor

rxin commented Jul 2, 2018

Does this actually work in SQL? How does it work when we don't have a data type that's a schema?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 2, 2018

Does this actually work in SQL?

Yes, it does. Please, have a look at the SQL test:
https://github.com/apache/spark/pull/21686/files#diff-3b8a538abd658a260aa32c4aa593bed7R41

How does it work when we don't have a data type that's a schema?

We recently supported schema as any data type in DDL format: #21550 . The new function uses the same mechanism as we use for schema inferring when we read files. So, it can infer and return any data type (but not MapType for now) as string in DDL format. And from_json() can accept it due to the #21550 . @rxin If I didn't answer to your question, please, clarify what do you mean.

@rxin
Copy link
Contributor

rxin commented Jul 3, 2018

Thanks. Awesome. This matches what I had in mind then.

extends UnaryExpression with String2StringExpression with CodegenFallback {

private val jsonOptions = new JSONOptions(Map.empty, "UTC")
private val jsonFactory = new JsonFactory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems jsonOptions.setJacksonOptions(factory) is missed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I really didn't know that I have to call the method.

DataType.fromDDL(ddlSchema.toString)
case e => throw new AnalysisException(
"Schema should be specified in DDL format as a string literal" +
s" or output of the schema_of_json function instead of $e")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: schema_of_json -> exp.prettyName

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp contains a schema or something like that. Maybe do you mean $e -> ${e.prettyName}?

Examples:
> SELECT _FUNC_('[{"col":0}]');
array<struct<col:int>>
""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since

>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(schema_of_json(df.value).alias("json")).collect()
[Row(json=u'struct<a:bigint>')]
>>> df.select(schema_of_json(lit('''{"a": 0}''')).alias("json")).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit '''{"a": 0}''' -> '{"a": 0}'

* @since 2.4.0
*/
def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {
withExpr {new JsonToStructs(e.expr, schema.expr, options.asScala.toMap)}
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{n -> { n or withExpr(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just withExpr(new JsonToStructs(...))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

}

/**
* (Scala-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Scala specific or Java specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this note at all. Why should it be Java or Scala specific?
I will change it to Java-specific to have it in the same style as other comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. That's because Java users will more likely use Java's collections rather than Scala's collection which works weirdly in Java side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we remove this version for now and add it when it's requested? There is a concern about this file getting too long and from_json has so many variants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we will convert object functions to package object functions and put json related methods to a separate file. This functions object becomes really big.

@HyukjinKwon
Copy link
Member

Seems fine to me otherwise.

@@ -2189,11 +2189,16 @@ def from_json(col, schema, options={}):
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(from_json(df.value, schema).alias("json")).collect()
[Row(json=[Row(a=1)])]
>>> schema = schema_of_json(lit('''{"a": 0}'''))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: '''{"a": 0}''' -> '{"a": 0}'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to fix other examples above too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for other functions too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I mean the examples here in this function.

* @group collection_funcs
* @since 2.4.0
*/
def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me leave my last comment, #21686 (comment) in case it's missed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call the method from python: https://github.com/apache/spark/pull/21686/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2202

Do you really want to revert changes for python?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I am fine then. Thanks.

:param col: string column in json format

>>> from pyspark.sql.types import *
>>> data = [(1, '''{"a": 1}''')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@SparkQA
Copy link

SparkQA commented Jul 3, 2018

Test build #92575 has finished for PR 21686 at commit c993fd1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92578 has finished for PR 21686 at commit 86f6886.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92580 has finished for PR 21686 at commit dc35731.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 776f299 Jul 4, 2018
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.

One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance:

```sql
select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;
```

Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql`

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes apache#21686 from MaxGekk/infer_schema_json.
@MaxGekk MaxGekk deleted the infer_schema_json branch August 17, 2019 13:34
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.

One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance:

```sql
select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}'))
from json_table;
```

Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql`

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes apache#21686 from MaxGekk/infer_schema_json.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants