-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406
Conversation
// Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case. | ||
// To keep back-compatibility, we use `fromJson` first, and then try the new API. | ||
def fromString(text: String): DataType = { | ||
try { fromJson(text) } catch { case _: Throwable => CatalystSqlParser.parseTableSchema(text) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwable
-> NonFatal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
// Until Spark-2.1, we use json strings for defining schemas in user-facing APIs. | ||
// Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case. | ||
// To keep back-compatibility, we use `fromJson` first, and then try the new API. | ||
def fromString(text: String): DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a new API, why need to keep back-compatibility? For the use of json, there is fromJson
and I think you don't change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this pr, I just propose we support new DDL formats in the existing APIs that's already used json formats for defining schemas (e.g., functions.from_json
). So, we need to support both json formats and DDL formats there. If I misunderstood you, could you correct me? Thanks for your comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean users should use fromString
just for DDL formats.
It is pretty weird that an API named from_json
is used to parse both json and DDL format.
How about add a new from_ddl
or from_string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. WDYT? cc: @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for interrupting. Actually, my point is a bit different. I thought DataType.fromString
should take only the SQL type string with proper documentation unless we are going to deprecate DataType.fromJson
on this purpose, and both cases should be handled separately whether within functions.from_json
or somewhere.
I guess functions.from_json
refers the data itself is in json format not the schema. So probably, it is fine if we document this well.
To cut it short, my point was DataType.fromString
looks going to be exposed as a separate functionality as not a private function and therefore we should not include another existing functionality within this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataType.fromString
should only take DDL format.
DataType.fromJson
should only take json.
I don't think functions.from_json
should accept both DDL format and json like current change does.
Maybe we can add a new from_ddl
or from_string
using DataType.fromString
, like from_json
using DataType.fromJson
?
That is basically my point here. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, to cut my comment even shorter, for me,
+1 only for ...
DataType.fromString
should only take DDL format.
DataType.fromJson
should only take json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Got it. I got confused from the name offrom_json
...
Right. It is used to parse a column of json data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to support both json schema string and DDL format in from_json
, shall we add a DataType.fromStringOrJson
that does exactly the current DataType.fromString
does? And let DataType.fromString
only accepts DDL format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the valuable comments!
Yea, I basically agree with the @HyukjinKwon idea. Also, fromDdl
is better than fromString
to me as @viirya suggested. I'll update soon.
Test build #75141 has finished for PR 17406 at commit
|
Test build #75143 has finished for PR 17406 at commit
|
@gatorsmile could you check? |
Test build #75159 has finished for PR 17406 at commit
|
@@ -103,6 +104,8 @@ object DataType { | |||
|
|||
def fromJson(json: String): DataType = parseDataType(parse(json)) | |||
|
|||
def fromDdl(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromDdl
-> fromDDL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add comments above the function to explain what is the purpose of these functions.
} catch { | ||
case NonFatal(_) => DataType.fromDdl(schemaAsString) | ||
} | ||
Some(schema.asInstanceOf[StructType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleTextSource
is just a fake source. What is the reason you made these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I missed the point. okay, I'll revert this.
How about only focusing on the |
def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = { | ||
// Until Spark-2.1, we use json strings for defining schemas here. Since we add an user-friendly | ||
// API in the DDL parser, we employ DDL formats for the case. To keep back-compatibility, | ||
// we use `fromJson` first, and then try the new API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the new description to @parm schema
?
// Test DDL formats | ||
test(s"from ddl - $dataType") { | ||
assert(DataType.fromDdl(s"a ${dataType.sql}") === new StructType().add("a", dataType)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining a function like checkDataTypeFromJson
? It looks more consistent, if so.
Also please update the PR title and description. Thanks for working on it! |
okay, I'll update soon! Thanks! |
Test build #75196 has finished for PR 17406 at commit
|
retest this please |
oh, it seems we hit weird errors... |
Just submitted a fix. NVM |
Test build #75202 has finished for PR 17406 at commit
|
Jenkins, retest this please. |
Test build #75207 has finished for PR 17406 at commit
|
The issue fixed in a2ce0a2, so I'll retest again. |
Jenkins, retest this please. |
Test build #75213 has finished for PR 17406 at commit
|
checkDataTypeFromDDL(dataType) | ||
} | ||
|
||
// In some types, check json formats only because the types do not support DDL formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, except NullType
, all the following data types should be supported. The only issue is nullability. Our DDL does not recognize something like
CREATE TABLE TAB (COL1 CHAR(3) NOT NULL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I think so. we'd be better to file a JIRA for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Spark SQL, we do not enforce nullability. It is like a hint for optimization. Not supporting such an syntax is reasonable.
checkDataTypeFromJson(ArrayType(StringType, false)) | ||
|
||
checkDataTypeFromText(MapType(IntegerType, StringType, true)) | ||
checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can keep both, but add an extra one for checkDataTypeFromJson
when the nullability is false
.
checkDataTypeFromText(MapType(IntegerType, ArrayType(DoubleType), true))
checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
@@ -201,7 +216,7 @@ class DataTypeSuite extends SparkFunSuite { | |||
StructField("a", IntegerType, nullable = true), | |||
StructField("b", ArrayType(DoubleType), nullable = false), | |||
StructField("c", DoubleType, nullable = false, metadata))) | |||
checkDataTypeJsonRepr(structType) | |||
checkDataTypeFromJson(structType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
Test build #75231 has started for PR 17406 at commit |
Jenkins, retest this please. |
Test build #75236 has finished for PR 17406 at commit
|
@@ -103,6 +104,12 @@ object DataType { | |||
|
|||
def fromJson(json: String): DataType = parseDataType(parse(json)) | |||
|
|||
/** | |||
* Creates DataType for a given DDL-formatted string, which is a comma separated list of field | |||
* definitions, e.g., a INT, b STRING. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creates DataType
-> Creates StructType
.
Because parseTableSchema
always returns StructType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks comments! Fixed.
* Creates DataType for a given DDL-formatted string, which is a comma separated list of field | ||
* definitions, e.g., a INT, b STRING. | ||
*/ | ||
def fromDDL(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use StructType
as return type?
Test build #75252 has finished for PR 17406 at commit
|
retest this please |
Also cc @cloud-fan @hvanhovell If no further comment, maybe we can merge it to master tomorrow? |
Test build #75296 has started for PR 17406 at commit |
def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = | ||
from_json(e, DataType.fromJson(schema), options) | ||
def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = { | ||
val dataType = try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little concern: Won't the error message from parsing json be shadowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, right? cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think it's fine
* Creates StructType for a given DDL-formatted string, which is a comma separated list of field | ||
* definitions, e.g., a INT, b STRING. | ||
*/ | ||
def fromDDL(ddl: String): StructType = CatalystSqlParser.parseTableSchema(ddl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move it to object StructType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'll do soon
Ok. LGTM. |
Test build #75303 has finished for PR 17406 at commit
|
Test build #75304 has finished for PR 17406 at commit
|
checkDataTypeJsonRepr(ArrayType(StringType, false)) | ||
checkDataTypeJsonRepr(MapType(IntegerType, StringType, true)) | ||
checkDataTypeJsonRepr(MapType(IntegerType, ArrayType(DoubleType), false)) | ||
def checkDataTypeFromDDL(dataType: DataType, ignoreNullability: Boolean = false): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have the ignoreNullability
parameter? DataType.sql
doesn't contains nullability information so we should not expect to retain the nullability here. We should always compare the types by sameType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I got the point. I'll fix now
LGTM |
Test build #75313 has finished for PR 17406 at commit
|
@maropu Could you update the PR title? Then, I can merge it once it is done. Thanks! |
@gatorsmile okay, please. Thanks! |
Could you update the PR title?
|
oh, my bad. Fixed. |
Thanks! Merging to master. |
val dataType = try { | ||
DataType.fromJson(schema) | ||
} catch { | ||
case NonFatal(_) => StructType.fromDDL(schema) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why parsing schema from a string is implemented here but not inside of the JsonToStructs
expression? So, calling of from_json
in SQL and in Scala/API has different behaviour, right. Did you do that intentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you came to a wrong PR. The schema parsing was started in Scala side at maropu@fe33121#diff-b5e6d03d9c9afbfa925e039c48e31078608ea749c193e6af3087b79eb701bc7cR2877. I guess reason is that, the schema parameter was not intended to be used an expression before. Now we take it in SQL as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just missed it, so making the code in common seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @HyukjinKwon. Ah, I see, I totally forgot it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved parsing to a common place. Please, take a look at #30201
What changes were proposed in this pull request?
This pr added
StructType.fromDDL
to convert a DDL format string intoStructType
for defining schemas infunctions.from_json
.How was this patch tested?
Added tests in
JsonFunctionsSuite
.