[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406

maropu · 2017-03-24T03:52:11Z

What changes were proposed in this pull request?

This pr added StructType.fromDDL to convert a DDL format string into StructType for defining schemas in functions.from_json.

How was this patch tested?

Added tests in JsonFunctionsSuite.

gatorsmile · 2017-03-24T04:09:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

+  // Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case.
+  // To keep back-compatibility, we use `fromJson` first, and then try the new API.
+  def fromString(text: String): DataType = {
+    try { fromJson(text) } catch { case _: Throwable => CatalystSqlParser.parseTableSchema(text) }


Throwable -> NonFatal

viirya · 2017-03-24T05:33:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

+  // Until Spark-2.1, we use json strings for defining schemas in user-facing APIs.
+  // Since we add an user-friendly API in the DDL parser, we employ DDL formats for the case.
+  // To keep back-compatibility, we use `fromJson` first, and then try the new API.
+  def fromString(text: String): DataType = {


If this is a new API, why need to keep back-compatibility? For the use of json, there is fromJson and I think you don't change it.

In this pr, I just propose we support new DDL formats in the existing APIs that's already used json formats for defining schemas (e.g., functions.from_json). So, we need to support both json formats and DDL formats there. If I misunderstood you, could you correct me? Thanks for your comment!

I mean users should use fromString just for DDL formats.

It is pretty weird that an API named from_json is used to parse both json and DDL format.

How about add a new from_ddl or from_string?

Aha, I see. WDYT? cc: @gatorsmile

Sorry for interrupting. Actually, my point is a bit different. I thought DataType.fromString should take only the SQL type string with proper documentation unless we are going to deprecate DataType.fromJson on this purpose, and both cases should be handled separately whether within functions.from_json or somewhere.

I guess functions.from_json refers the data itself is in json format not the schema. So probably, it is fine if we document this well.

To cut it short, my point was DataType.fromString looks going to be exposed as a separate functionality as not a private function and therefore we should not include another existing functionality within this.

DataType.fromString should only take DDL format.
DataType.fromJson should only take json.

I don't think functions.from_json should accept both DDL format and json like current change does.

~~Maybe we can add a new from_ddl or from_string using DataType.fromString, like from_json using DataType.fromJson?~~

That is basically my point here. :-)

Yup, to cut my comment even shorter, for me,

+1 only for ...

DataType.fromString should only take DDL format.
DataType.fromJson should only take json.

Oh. Got it. I got confused from the name offrom_json...

Right. It is used to parse a column of json data...

If we are going to support both json schema string and DDL format in from_json, shall we add a DataType.fromStringOrJson that does exactly the current DataType.fromString does? And let DataType.fromString only accepts DDL format.

Thanks for the valuable comments!
Yea, I basically agree with the @HyukjinKwon idea. Also, fromDdl is better than fromString to me as @viirya suggested. I'll update soon.

SparkQA · 2017-03-24T06:21:27Z

Test build #75141 has finished for PR 17406 at commit 87abf7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-24T06:47:01Z

Test build #75143 has finished for PR 17406 at commit 2afa6f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-24T07:32:55Z

@gatorsmile could you check?

SparkQA · 2017-03-24T10:39:46Z

Test build #75159 has finished for PR 17406 at commit fb61535.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-24T18:00:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

@@ -103,6 +104,8 @@ object DataType {

  def fromJson(json: String): DataType = parseDataType(parse(json))

+  def fromDdl(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl)


fromDdl -> fromDDL

Also please add comments above the function to explain what is the purpose of these functions.

gatorsmile · 2017-03-24T18:06:50Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala

+    } catch {
+      case NonFatal(_) => DataType.fromDdl(schemaAsString)
+    }
+    Some(schema.asInstanceOf[StructType])


SimpleTextSource is just a fake source. What is the reason you made these changes?

Aha, I missed the point. okay, I'll revert this.

gatorsmile · 2017-03-24T18:08:36Z

How about only focusing on the from_json function in this PR?

gatorsmile · 2017-03-24T18:15:56Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+  def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = {
+    // Until Spark-2.1, we use json strings for defining schemas here. Since we add an user-friendly
+    // API in the DDL parser, we employ DDL formats for the case. To keep back-compatibility,
+    // we use `fromJson` first, and then try the new API.


How about?

In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for schema.

Please move the new description to @parm schema?

gatorsmile · 2017-03-24T18:19:34Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

+    // Test DDL formats
+    test(s"from ddl - $dataType") {
+      assert(DataType.fromDdl(s"a ${dataType.sql}") === new StructType().add("a", dataType))
+    }


How about defining a function like checkDataTypeFromJson ? It looks more consistent, if so.

gatorsmile · 2017-03-24T18:20:10Z

Also please update the PR title and description. Thanks for working on it!

maropu · 2017-03-25T01:08:34Z

okay, I'll update soon! Thanks!

SparkQA · 2017-03-25T03:08:38Z

Test build #75196 has finished for PR 17406 at commit ec5452f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-25T03:15:20Z

retest this please

maropu · 2017-03-25T03:16:59Z

oh, it seems we hit weird errors...

gatorsmile · 2017-03-25T03:31:54Z

Just submitted a fix. NVM

SparkQA · 2017-03-25T04:53:27Z

Test build #75202 has finished for PR 17406 at commit ec5452f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-25T05:20:27Z

Jenkins, retest this please.

SparkQA · 2017-03-25T06:59:15Z

Test build #75207 has finished for PR 17406 at commit ec5452f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-25T07:59:06Z

The issue fixed in a2ce0a2, so I'll retest again.

maropu · 2017-03-25T07:59:15Z

Jenkins, retest this please.

SparkQA · 2017-03-25T10:19:57Z

Test build #75213 has finished for PR 17406 at commit ec5452f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-25T18:37:31Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

+    checkDataTypeFromDDL(dataType)
+  }
+
+  // In some types, check json formats only because the types do not support DDL formats.


Actually, except NullType, all the following data types should be supported. The only issue is nullability. Our DDL does not recognize something like

CREATE TABLE TAB (COL1 CHAR(3) NOT NULL)

yea, I think so. we'd be better to file a JIRA for that?

In Spark SQL, we do not enforce nullability. It is like a hint for optimization. Not supporting such an syntax is reasonable.

gatorsmile · 2017-03-25T18:39:04Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

+  checkDataTypeFromJson(ArrayType(StringType, false))
+
+  checkDataTypeFromText(MapType(IntegerType, StringType, true))
+  checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false))


You can keep both, but add an extra one for checkDataTypeFromJson when the nullability is false.

checkDataTypeFromText(MapType(IntegerType, ArrayType(DoubleType), true)) checkDataTypeFromJson(MapType(IntegerType, ArrayType(DoubleType), false))

gatorsmile · 2017-03-25T18:39:29Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

@@ -201,7 +216,7 @@ class DataTypeSuite extends SparkFunSuite {
    StructField("a", IntegerType, nullable = true),
    StructField("b", ArrayType(DoubleType), nullable = false),
    StructField("c", DoubleType, nullable = false, metadata)))
-  checkDataTypeJsonRepr(structType)
+  checkDataTypeFromJson(structType)


The same here

SparkQA · 2017-03-26T05:02:29Z

Test build #75231 has started for PR 17406 at commit e2d121f.

maropu · 2017-03-26T10:22:26Z

Jenkins, retest this please.

SparkQA · 2017-03-26T12:47:05Z

Test build #75236 has finished for PR 17406 at commit e2d121f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-27T02:42:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

@@ -103,6 +104,12 @@ object DataType {

  def fromJson(json: String): DataType = parseDataType(parse(json))

+  /**
+   * Creates DataType for a given DDL-formatted string, which is a comma separated list of field
+   * definitions, e.g., a INT, b STRING.


Creates DataType -> Creates StructType.

Because parseTableSchema always returns StructType.

Thanks comments! Fixed.

viirya · 2017-03-27T03:14:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

+   * Creates DataType for a given DDL-formatted string, which is a comma separated list of field
+   * definitions, e.g., a INT, b STRING.
+   */
+  def fromDDL(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl)


Shall we use StructType as return type?

SparkQA · 2017-03-27T06:20:26Z

Test build #75252 has finished for PR 17406 at commit 842ca77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-28T05:00:44Z

retest this please

gatorsmile · 2017-03-28T05:01:40Z

Also cc @cloud-fan @hvanhovell

If no further comment, maybe we can merge it to master tomorrow?

SparkQA · 2017-03-28T05:02:33Z

Test build #75296 has started for PR 17406 at commit 842ca77.

viirya · 2017-03-28T05:41:47Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-  def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column =
-    from_json(e, DataType.fromJson(schema), options)
+  def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = {
+    val dataType = try {


A little concern: Won't the error message from parsing json be shadowed?

That is fine, right? cc @cloud-fan

yea I think it's fine

cloud-fan · 2017-03-28T06:41:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

+   * Creates StructType for a given DDL-formatted string, which is a comma separated list of field
+   * definitions, e.g., a INT, b STRING.
+   */
+  def fromDDL(ddl: String): StructType = CatalystSqlParser.parseTableSchema(ddl)


shall we move it to object StructType?

okay, I'll do soon

viirya · 2017-03-28T06:45:42Z

Ok. LGTM.

SparkQA · 2017-03-28T09:50:28Z

Test build #75303 has finished for PR 17406 at commit 83f5d6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T09:58:03Z

Test build #75304 has finished for PR 17406 at commit 9ddd515.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-28T14:19:18Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

-  checkDataTypeJsonRepr(ArrayType(StringType, false))
-  checkDataTypeJsonRepr(MapType(IntegerType, StringType, true))
-  checkDataTypeJsonRepr(MapType(IntegerType, ArrayType(DoubleType), false))
+  def checkDataTypeFromDDL(dataType: DataType, ignoreNullability: Boolean = false): Unit = {


why do we have the ignoreNullability parameter? DataType.sql doesn't contains nullability information so we should not expect to retain the nullability here. We should always compare the types by sameType.

Aha, I got the point. I'll fix now

cloud-fan · 2017-03-28T15:14:03Z

LGTM

SparkQA · 2017-03-28T17:04:13Z

Test build #75313 has finished for PR 17406 at commit 8448d7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-28T19:19:59Z

@maropu Could you update the PR title? Then, I can merge it once it is done. Thanks!

maropu · 2017-03-28T23:51:19Z

@gatorsmile okay, please. Thanks!

gatorsmile · 2017-03-29T04:31:10Z

Could you update the PR title?

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json

maropu · 2017-03-29T04:32:37Z

oh, my bad. Fixed.

gatorsmile · 2017-03-29T19:38:23Z

Thanks! Merging to master.

MaxGekk · 2020-10-29T17:16:11Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+    val dataType = try {
+      DataType.fromJson(schema)
+    } catch {
+      case NonFatal(_) => StructType.fromDDL(schema)
+    }


I am wondering why parsing schema from a string is implemented here but not inside of the JsonToStructs expression? So, calling of from_json in SQL and in Scala/API has different behaviour, right. Did you do that intentionally?

I think you came to a wrong PR. The schema parsing was started in Scala side at maropu@fe33121#diff-b5e6d03d9c9afbfa925e039c48e31078608ea749c193e6af3087b79eb701bc7cR2877. I guess reason is that, the schema parameter was not intended to be used an expression before. Now we take it in SQL as well.

I think I just missed it, so making the code in common seems fine.

Thanks, @HyukjinKwon. Ah, I see, I totally forgot it...

I moved parsing to a common place. Please, take a look at #30201

Use DDL strings for defining schema

87abf7b

gatorsmile reviewed Mar 24, 2017

View reviewed changes

Apply comments

2afa6f3

viirya reviewed Mar 24, 2017

View reviewed changes

Rename fromString to fromDdl and change some logics

fb61535

gatorsmile reviewed Mar 24, 2017

View reviewed changes

Apply Xiao's review

ec5452f

gatorsmile reviewed Mar 25, 2017

View reviewed changes

Modify comments

e2d121f

viirya reviewed Mar 27, 2017

View reviewed changes

Apply comments

842ca77

viirya reviewed Mar 28, 2017

View reviewed changes

cloud-fan reviewed Mar 28, 2017

View reviewed changes

Move fromDDL to StructType

9ddd515

maropu force-pushed the SPARK-20009 branch from 83f5d6e to 9ddd515 Compare March 28, 2017 07:33

cloud-fan reviewed Mar 28, 2017

View reviewed changes

Remove nullability check

8448d7a

maropu changed the title ~~[SPARK-20009][SQL] Use DDL strings for defining schema in user-facing APIs~~ [SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json Mar 29, 2017

asfgit closed this in c400848 Mar 29, 2017

MaxGekk reviewed Oct 29, 2020

View reviewed changes

		@@ -103,6 +104,8 @@ object DataType {

		def fromJson(json: String): DataType = parseDataType(parse(json))

		def fromDdl(ddl: String): DataType = CatalystSqlParser.parseTableSchema(ddl)

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406

[SPARK-20009][SQL] Support DDL strings for defining schema in functions.from_json #17406

Conversation

maropu commented Mar 24, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

viirya Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 24, 2017

SparkQA commented Mar 24, 2017

maropu commented Mar 24, 2017

SparkQA commented Mar 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 24, 2017

gatorsmile Mar 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 24, 2017

maropu commented Mar 25, 2017

SparkQA commented Mar 25, 2017

gatorsmile commented Mar 25, 2017

maropu commented Mar 25, 2017

gatorsmile commented Mar 25, 2017

SparkQA commented Mar 25, 2017

maropu commented Mar 25, 2017

SparkQA commented Mar 25, 2017

maropu commented Mar 25, 2017

maropu commented Mar 25, 2017

SparkQA commented Mar 25, 2017

gatorsmile Mar 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 26, 2017

maropu commented Mar 26, 2017

SparkQA commented Mar 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 27, 2017

gatorsmile commented Mar 28, 2017

gatorsmile commented Mar 28, 2017

SparkQA commented Mar 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 28, 2017

SparkQA commented Mar 28, 2017

SparkQA commented Mar 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 28, 2017

SparkQA commented Mar 28, 2017

gatorsmile commented Mar 28, 2017

maropu commented Mar 28, 2017

gatorsmile commented Mar 29, 2017

maropu commented Mar 24, 2017 •

edited

Loading

viirya Mar 24, 2017 •

edited

Loading

HyukjinKwon Mar 24, 2017 •

edited

Loading

viirya Mar 24, 2017 •

edited

Loading

maropu Mar 24, 2017 •

edited

Loading

gatorsmile Mar 24, 2017 •

edited

Loading

gatorsmile Mar 24, 2017 •

edited

Loading

gatorsmile Mar 25, 2017 •

edited

Loading

HyukjinKwon Oct 30, 2020 •

edited

Loading