[SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark#30201
[SPARK-33299][SQL] Support schema in JSON format by from_json/from_csv Spark SQL and PySpark#30201MaxGekk wants to merge 11 commits intoapache:masterfrom
from_json/from_csv Spark SQL and PySpark#30201Conversation
from_json/from_csv in all APIsfrom_json/from_csv across all APIs
|
@HyukjinKwon @maropu Please, review this PR. |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
We should probably update SparkR (in collection functions doc) and PySpark docs. |
| fallbackParser = DataType.fromDDL) | ||
| from_json(e, dataType, options) | ||
| def from_json(e: Column, schema: String, options: Map[String, String]): Column = withExpr { | ||
| new JsonToStructs(e.expr, lit(schema).expr, options) |
There was a problem hiding this comment.
Could you update ExpressionDescription (usage and examples) of CsvToStructs and JsonToStructs?
|
Test build #130449 has finished for PR 30201 at commit
|
|
Test build #130462 has finished for PR 30201 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130465 has finished for PR 30201 at commit
|
|
I guess SparkR especially handles schema strings. It seems it doesn't support schemas in JSON format. @HyukjinKwon , correct? Line 2522 in 3beab8d I am going to revert my comments in SparkR. |
from_json/from_csv across all APIsfrom_json/from_csv Spark SQL and PySpark
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #130470 has finished for PR 30201 at commit
|
I was thinking more about example of using JSON input for schema like the one you've added in But it doesn't seem like there's any user-visible change in Python behavior here after all, is there?
Yes, my bad. |
|
Thanks @MaxGekk for catching this. Will take a look tomorrow. |
|
Okay, I have thought a while + had a offline discussion with @MaxGekk. Looks like we should better just fix the doc in PySpark, and don't support JSON format in PySpark, SQL, and SparkR for now. We can keep it in Scala side for a legacy reason. The reason is that, I do think JSON format is sort of internal, and we should better promote DDL string types instead. I haven't heard any complaint about JSON string support so far (surprisingly). |
|
Please, review this #30226 |
What changes were proposed in this pull request?
Move schema parsing from
from_jsonin Scala API to the common function inExprUtils.evalTypeExprwhich is used in theJsonToStructsandCsvToStructsexpressions, and as a consequence in SQL API, for instance.Why are the changes needed?
Currently,
from_jsonhas different behavior in Scala API and in PySpark/SQL. In Scala API, it accepts schema in JSON format. To improve user experience with Spark SQL and PySpark, this PR proposes to unify the behavior offrom_jsonandfrom_csv.Does this PR introduce any user-facing change?
Yes.
Before (in Spark 3.0)
After:
How was this patch tested?
Added test cases to
csv-functions.sqlandjson-functions.sq