-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24709][SQL] schema_of_json() - schema inference from an example #21686
Conversation
Test build #92510 has finished for PR 21686 at commit
|
Test build #92511 has finished for PR 21686 at commit
|
@rxin, does this look okay to you? If so will check closely and get this in. |
Does this actually work in SQL? How does it work when we don't have a data type that's a schema? |
Yes, it does. Please, have a look at the SQL test:
We recently supported schema as any data type in DDL format: #21550 . The new function uses the same mechanism as we use for schema inferring when we read files. So, it can infer and return any data type (but not |
Thanks. Awesome. This matches what I had in mind then. |
extends UnaryExpression with String2StringExpression with CodegenFallback { | ||
|
||
private val jsonOptions = new JSONOptions(Map.empty, "UTC") | ||
private val jsonFactory = new JsonFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems jsonOptions.setJacksonOptions(factory)
is missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out. I really didn't know that I have to call the method.
DataType.fromDDL(ddlSchema.toString) | ||
case e => throw new AnalysisException( | ||
"Schema should be specified in DDL format as a string literal" + | ||
s" or output of the schema_of_json function instead of $e") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: schema_of_json
-> exp.prettyName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exp
contains a schema or something like that. Maybe do you mean $e
-> ${e.prettyName}
?
Examples: | ||
> SELECT _FUNC_('[{"col":0}]'); | ||
array<struct<col:int>> | ||
""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since
python/pyspark/sql/functions.py
Outdated
>>> df = spark.createDataFrame(data, ("key", "value")) | ||
>>> df.select(schema_of_json(df.value).alias("json")).collect() | ||
[Row(json=u'struct<a:bigint>')] | ||
>>> df.select(schema_of_json(lit('''{"a": 0}''')).alias("json")).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit '''{"a": 0}'''
-> '{"a": 0}'
* @since 2.4.0 | ||
*/ | ||
def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = { | ||
withExpr {new JsonToStructs(e.expr, schema.expr, options.asScala.toMap)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{n
-> { n
or withExpr(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just withExpr(new JsonToStructs(...))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
} | ||
|
||
/** | ||
* (Scala-specific) Parses a column containing a JSON string into a `MapType` with `StringType` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it Scala specific or Java specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this note at all. Why should it be Java or Scala specific?
I will change it to Java-specific
to have it in the same style as other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. That's because Java users will more likely use Java's collections rather than Scala's collection which works weirdly in Java side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we remove this version for now and add it when it's requested? There is a concern about this file getting too long and from_json
has so many variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we will convert object functions
to package object functions
and put json related methods to a separate file. This functions object
becomes really big.
Seems fine to me otherwise. |
@@ -2189,11 +2189,16 @@ def from_json(col, schema, options={}): | |||
>>> df = spark.createDataFrame(data, ("key", "value")) | |||
>>> df.select(from_json(df.value, schema).alias("json")).collect() | |||
[Row(json=[Row(a=1)])] | |||
>>> schema = schema_of_json(lit('''{"a": 0}''')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: '''{"a": 0}'''
-> '{"a": 0}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to fix other examples above too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for other functions too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I mean the examples here in this function.
* @group collection_funcs | ||
* @since 2.4.0 | ||
*/ | ||
def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me leave my last comment, #21686 (comment) in case it's missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call the method from python: https://github.com/apache/spark/pull/21686/files#diff-f5295f69bfbdbf6e161aed54057ea36dR2202
Do you really want to revert changes for python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I am fine then. Thanks.
python/pyspark/sql/functions.py
Outdated
:param col: string column in json format | ||
|
||
>>> from pyspark.sql.types import * | ||
>>> data = [(1, '''{"a": 1}''')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
Test build #92575 has finished for PR 21686 at commit
|
Test build #92578 has finished for PR 21686 at commit
|
Test build #92580 has finished for PR 21686 at commit
|
Merged to master. |
In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format. One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance: ```sql select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}')) from json_table; ``` Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes apache#21686 from MaxGekk/infer_schema_json.
In the PR, I propose to add new function - *schema_of_json()* which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format. One of the use cases is using of *schema_of_json()* in the combination with *from_json()*. Currently, _from_json()_ requires a schema as a mandatory argument. The *schema_of_json()* function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance: ```sql select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}')) from json_table; ``` Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes apache#21686 from MaxGekk/infer_schema_json.
What changes were proposed in this pull request?
In the PR, I propose to add new function - schema_of_json() which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format.
One of the use cases is using of schema_of_json() in the combination with from_json(). Currently, from_json() requires a schema as a mandatory argument. The schema_of_json() function will allow to point out an JSON string as an example which has the same schema as the first argument of from_json(). For instance:
How was this patch tested?
Added new test to
JsonFunctionsSuite
,JsonExpressionsSuite
and SQL tests tojson-functions.sql